Ask Your Question
1

Generate multiple records from single record

asked 2018-08-21 05:13:31 -0500

this post is marked as community wiki

This post is a wiki. Anyone with karma >75 is welcome to improve it.

I have tab separated input data with a column having multiple comma separated values in the form of

Field_1\tField_2\tField_3\tField_4\tField_5,Field_6,Field_7\tField_8\tField_9

And i want this single record to be converted to multiple records as

Field_1\tField_2\tField_3\tField_4\tField_5\tField_8\tField_9 Field_1\tField_2\tField_3\tField_4\tField_6\tField_8\tField_9 Field_1\tField_2\tField_3\tField_4\tField_7\tField_8\tField_9

Can this be achieved using Groovy Evaluator?

edit retag flag offensive close merge delete

1 Answer

Sort by ยป oldest newest most voted
1

answered 2018-08-21 11:05:16 -0500

metadaddy gravatar image

You could certainly do this with the Groovy Evaluator, but you don't need to. You can use existing processors to do this. The pipeline will run faster, and it's quite straightforward:

image description

The origin stage is configured to parse tab separated data. I've used the Dev Raw Data Source origin, but you would use whichever origin makes sense for where your data is, with Data Format configured to Delimited tab separated values. Note that you should parse the data into a List-Map root field, since we'll be creating a hierarchy of fields, rather than a simple list.

image description

This results in a record with the structure:

image description

The second stage is a Data Parser processor - it is configured to parse field /4 as CSV:

image description

The output from the Data Parser has a hierarchical structure:

image description

The Field Pivoter is configured to pivot field /4 into multiple records, copying the remaining fields into each:

image description

The output from the Field Pivoter has the correct fields, but they are no longer in the correct order - new fields are added to the end of the list:

image description

The Field Order processor is configured to reorder the fields as required:

image description

The output from Field Order is as you might expect:

image description

The destination is configured to write tab separated data.

image description

The sample pipeline writes to the local filesystem, but you could write to Hadoop FS, Amazon S3, Kafka, or any other destination that supports delimited format.

My output file on disk - cat -t shows tabs as ^I and newlines as ^M:

$ cat -t /tmp/out/2018-08-21-15/sdc-24e75fba-bd00-42fd-80c3-1f591e200ca6_d284b3e3-dccb-4c88-8328-ef6cf92858c9 
Field_1^IField_2^IField_3^IField_4^IField_5^IField_8^IField_9^M
Field_1^IField_2^IField_3^IField_4^IField_6^IField_8^IField_9^M
Field_1^IField_2^IField_3^IField_4^IField_7^IField_8^IField_9^M
edit flag offensive delete link more

Comments

Thanks a lot. The approach worked with minor tweaks.

Azkaban153 gravatar imageAzkaban153 ( 2018-08-24 06:07:46 -0500 )edit
Login/Signup to Answer

Question Tools

1 follower

Stats

Asked: 2018-08-21 05:13:31 -0500

Seen: 47 times

Last updated: Aug 21