Joining multiple CSVs using hive

I have exported few tables from postgres to S3 using JDBC multitable origin. In a subsequent pipeline I want to join those CSVs and create flattened tables. Normally I can achieve this by creating hive external tables on S3 folders and joining them. How can I achieve the same in streamsets? Please help. I have been trying multiple things for 2 days but not able to move forward on this.

Thanks, Aditya

1 Answer

I don't think this is possible with StreamSets Data Collector. Data Collector is designed for record-by-record ingest, rather than joining multiple data sets.

One way to do this, if your data is all in the same PostgreSQL database, is to use the JDBC Query Consumer origin and write a SQL query that joins the data in PostgreSQL.

Another option is to use a tool such as AWS Glue to work with the data in S3. There are examples that cover joining CSV data.

