How do I make StreamSets run in a clustered environment?

We have added StreamSets as a parcel to our Cloudera cluster and deployed it to two nodes but each StreamSets node is acting like a separate instance. I found the below answer in FAQ's which says it can run on a clustered environment:

Does the Data Collector run in a clustered environment? Yes, Data Collector utilizes your existing YARN and Spark Streaming implementation to spawn additional workers as needed for scalability.

But to my understanding it doesn't answer the high availability of StreamSets. If I start a pipeline from my instance-a i can request for more workers but what happens if my instance-a goes down for some reason , i cannot see the pipeline on my instance-b; not sure if this is a configuration somewhere. What is the use of deploying StreamSets on to different hosts (unless for standby). Not sure if I missed anything. I followed the steps as mentioned in the link

I tried the JDBC origin, it came up with the warning. but what i would like to achieve is to see the same pipelines when I login from different StreamSets nodes. so if one of my node goes down i still have one instance of StreamSets (high availability).

Which origin are you using? Cluster batch mode works only for Hadoop FS / MapR-FS origins; cluster streaming for Kafka / Map-R Streams. See

@metadaddy It would be better if there is a tutorial about setup cluster pipeline and run example data processing

1 Answer

StreamSets Control Hub manages multiple Data Collector instances and can fail over jobs from one instance to another if an instance fails. Check out the product page and feel free to request a demo!

Is StreamSets Control Hub included in Enterprise Platform?

