Ask Your Question
1

Text Data Format automatic delimiter

asked 2017-10-10 11:28:35 -0500

amandra gravatar image

updated 2017-10-19 13:21:57 -0500

LC gravatar image

I'm exploring the performance of StreamSets. I'm importing log files from an S3 origin that need to be converted by a custom processor. Example log:

{@BATCH|8495c_zp|7065 / 7066 / 7067 / 7068|4055|1||btest|170414104111||HP20|8495c_zp|RevA||| {@BTEST|default_SN1|82|170414104238|000005|0|all||n||170414104243||1 {@BLOCK|power|00 {@A-MEA|0|+4.805986E+01|+48V_P{@LIM2|... {@A-MEA|0|+4.802368E+01|+48V_BIAS{@LI... } }}

When I select Text format the S3 stages automatically delimits each line into a separate record which is undesirable for the parser processor. The only way I found to stop this behavior was to use a Custom Delimiter of NULL. However, I noticed with that delimiter the stage no longer batches and is significantly slower. Is there a way to have that data come in as unmolested text that can still batch?

edit retag flag offensive close merge delete

1 Answer

Sort by ยป oldest newest most voted
0

answered 2017-10-10 14:51:53 -0500

metadaddy gravatar image

You can use Whole File data format to skip parsing altogether.

In the Groovy, JavaScript, or Jython Evaluator, you can use the getInputStream() API to access the data directly.

If you're writing a custom processor using the Java API, call createInputStream() on the FileRef class to do the same. See the custom processor tutorial for an example.

edit flag offensive delete link more

Comments

1

That was my initial attempt, but it seemed slow compared to the text. I just converted the processor to handle both types so I can compare the performance, since there was plenty of room for error on those first attempts.

amandra gravatar imageamandra ( 2017-10-10 15:31:07 -0500 )edit

How did you write your processor? Java? Groovy?

metadaddy gravatar imagemetadaddy ( 2017-10-10 16:23:24 -0500 )edit

The processor is in Java with a JNI component. After testing with a processor that can handle file/text I found that neither will batch (1rec/batch) and that the text operates at about 5/s and the files at about 2/s. Trash also operates at about 5/s, no batching.

amandra gravatar imageamandra ( 2017-10-10 16:33:35 -0500 )edit
Login/Signup to Answer

Question Tools

1 follower

Stats

Asked: 2017-10-10 11:28:35 -0500

Seen: 934 times

Last updated: Oct 19 '17