Ask Your Question
1

Joining multiple CSVs using hive

asked 2019-06-07 04:33:07 -0500

Aditya gravatar image

I have exported few tables from postgres to S3 using JDBC multitable origin. In a subsequent pipeline I want to join those CSVs and create flattened tables. Normally I can achieve this by creating hive external tables on S3 folders and joining them. How can I achieve the same in streamsets? Please help. I have been trying multiple things for 2 days but not able to move forward on this.

Thanks, Aditya

edit retag flag offensive close merge delete

1 Answer

Sort by ยป oldest newest most voted
0

answered 2019-06-10 10:13:51 -0500

metadaddy gravatar image

I don't think this is possible with StreamSets Data Collector. Data Collector is designed for record-by-record ingest, rather than joining multiple data sets.

One way to do this, if your data is all in the same PostgreSQL database, is to use the JDBC Query Consumer origin and write a SQL query that joins the data in PostgreSQL.

Another option is to use a tool such as AWS Glue to work with the data in S3. There are examples that cover joining CSV data.

edit flag offensive delete link more
Login/Signup to Answer

Question Tools

1 follower

Stats

Asked: 2019-06-07 04:33:07 -0500

Seen: 103 times

Last updated: Jun 10