Ask Your Question

How do I convert delimited data to Avro?

asked 2018-07-17 16:30:37 -0500

jsphar gravatar image

updated 2018-07-18 09:35:42 -0500

metadaddy gravatar image

I am looking for some help in using the JSON converter or advice if this is the right converter to do what I am trying to do. I have data coming in from a Kafka topic that looks like the following:

XYZ,1,10132977,-121.935583,37.505102,9,12,0,0.0,7/17/18 21:17:24,278.875,-1.160,0.360,9.740,-9.375,5.062,278.875,0.188,-0.750,-0.375,106,16,5,,IDLE;

The data does not have headers since we usually match these into a database schema where the fields match the exact order that the data comes over as.

So, the problem I am trying to solve is what is the best way to convert this text into JSON so that I can convert and compress it to Avro before stuffing it into S3? If I send a proper JSON message as a test from Kafka my current model works as intended but it seems apparent that I have to have JSON into to convert to the Avro schema. Can I use the JSON Generator for this? Is there a way to add the headers into the JSON message to field match the text data coming in?

I appreciate any guidance on this topic.

edit retag flag offensive close merge delete


Is the data in the message actually surrounded by quotes?

metadaddy gravatar imagemetadaddy ( 2018-07-17 20:34:52 -0500 )edit

No. The data is just comma separated text with no quotes. The quotes just show up in the Kafka consumer display. Sorry for the confusion.

jsphar gravatar imagejsphar ( 2018-07-18 08:29:37 -0500 )edit

1 Answer

Sort by ยป oldest newest most voted

answered 2018-07-18 11:41:58 -0500

iamontheinet gravatar image

updated 2018-07-18 11:47:07 -0500


I don't believe you need to convert csv data/text to JSON before storing it in Avro format on AWS S3. The following setup should work just fine:

image description

Here are the individual stage settings. (Note: In Field Renamer be sure to include all 25 fields -- from 0 to 24)

Kafka Consumer

image description

Field Renamer

image description

Schema Generator

image description

Amazon S3

image description

And here's my output stored on AWS S3:

image description

Hope this helps!

Cheers, Dash

edit flag offensive delete link more


Hi Dash, Thanks a lot for the suggestions. Just curious, would there be any reason why I do not have the "Include Schema" option in my "Data Format" for Amazon S3? I am using version 3.3.0 and the package I have installed is Amazon Web Services 1.11.123 streamsets-datacollector-aws-lib. Jason

jsphar gravatar imagejsphar ( 2018-07-18 13:49:08 -0500 )edit

Oh, also wanted to ask if data collector has the option to pool records for a period of time before sending them to S3, more like a Batch option if toy will. Thanks again, Jason.

jsphar gravatar imagejsphar ( 2018-07-18 13:50:19 -0500 )edit

You're welcome! I am running nightly build -- soon to be released 3.4.0 version. However, unchecking "Include Schema" and rerunning the pipeline didn't show me any visual differences. So I'd suggest running it and then making sure you get the desired output in your S3 bucket.

iamontheinet gravatar imageiamontheinet ( 2018-07-18 13:56:45 -0500 )edit

I will have a look. Thanks again for your help!

jsphar gravatar imagejsphar ( 2018-07-18 14:48:38 -0500 )edit
Login/Signup to Answer

Question Tools



Asked: 2018-07-17 16:30:37 -0500

Seen: 97 times

Last updated: Jul 18