Ask Your Question

Text Data Format automatic delimiter

asked 2017-10-10 11:28:35 -0600

amandra gravatar image

updated 2017-10-19 13:21:57 -0600

LC gravatar image

I'm exploring the performance of StreamSets. I'm importing log files from an S3 origin that need to be converted by a custom processor. Example log:

{@BATCH|8495c_zp|7065 / 7066 / 7067 / 7068|4055|1||btest|170414104111||HP20|8495c_zp|RevA||| {@BTEST|default_SN1|82|170414104238|000005|0|all||n||170414104243||1 {@BLOCK|power|00 {@A-MEA|0|+4.805986E+01|+48V_P{@LIM2|... {@A-MEA|0|+4.802368E+01|+48V_BIAS{@LI... } }}

When I select Text format the S3 stages automatically delimits each line into a separate record which is undesirable for the parser processor. The only way I found to stop this behavior was to use a Custom Delimiter of NULL. However, I noticed with that delimiter the stage no longer batches and is significantly slower. Is there a way to have that data come in as unmolested text that can still batch?

edit retag flag offensive close merge delete

1 Answer

Sort by ยป oldest newest most voted

answered 2017-10-10 14:51:53 -0600

metadaddy gravatar image

You can use Whole File data format to skip parsing altogether.

In the Groovy, JavaScript, or Jython Evaluator, you can use the getInputStream() API to access the data directly.

If you're writing a custom processor using the Java API, call createInputStream() on the FileRef class to do the same. See the custom processor tutorial for an example.

edit flag offensive delete link more



That was my initial attempt, but it seemed slow compared to the text. I just converted the processor to handle both types so I can compare the performance, since there was plenty of room for error on those first attempts.

amandra gravatar imageamandra ( 2017-10-10 15:31:07 -0600 )edit

How did you write your processor? Java? Groovy?

metadaddy gravatar imagemetadaddy ( 2017-10-10 16:23:24 -0600 )edit

The processor is in Java with a JNI component. After testing with a processor that can handle file/text I found that neither will batch (1rec/batch) and that the text operates at about 5/s and the files at about 2/s. Trash also operates at about 5/s, no batching.

amandra gravatar imageamandra ( 2017-10-10 16:33:35 -0600 )edit
Login/Signup to Answer

Question Tools

1 follower


Asked: 2017-10-10 11:28:35 -0600

Seen: 5,970 times

Last updated: Oct 19 '17