StreamSets and its applicability for our specific use Case

asked 2019-01-11 03:42:59 -0500

Deepak Rajendra gravatar image


I am new to StreamSets. I have just begun to explore its potential and I find the data streaming concept in StreamSets requires a big mindset shift for someone like me coming from the Alteryx world.

We are in the process of moving data from Netezza (analytical database) to BDPaaS (our new big data platform). The goal is to decommission Netezza and use BDPaaS as a source for all our analytical needs. Netezza is high speed analytical database but it isn't working so well for all the data modeling work therefore the motive to move to BDPaaS.

We have a data science edge node with StreamSets set up on it and one of the things I would like to do with StreamSets is to 1) Point StreamSets to Netezza 2) Instruct StreamSets (using Jdbc multitable consumer) to pull down all the tables from a specific database 3) Copy tables (probably in parquet file format) in their respective sub-folders within a specific directory on our BDPaaS tenant space (configured with MapR cluster) 4) Point Hive to the sub-folders and update Hive metastore and use HIve to query the files.

Note: Currently Netezza has billions of transactions (the biggest table so far has 10 billion txns). So volume will dictate the approach we will be taking.

I would be interested in getting as many inputs/comments/suggestions as possible from the experts here to make sure we go down the right path. Thank you so much.

edit retag flag offensive close merge delete


Any inputs please? If you require further details, I am happy to provide. Thank you so much.

Deepak Rajendra gravatar imageDeepak Rajendra ( 2019-01-14 01:40:10 -0500 )edit