Ask Your Question
1

How do I make StreamSets run in a clustered environment?

asked 2017-12-21 13:26:02 -0500

krishnaM gravatar image

updated 2018-06-01 18:03:20 -0500

metadaddy gravatar image

We have added StreamSets as a parcel to our Cloudera cluster and deployed it to two nodes but each StreamSets node is acting like a separate instance. I found the below answer in FAQ's which says it can run on a clustered environment:

Does the Data Collector run in a clustered environment? Yes, Data Collector utilizes your existing YARN and Spark Streaming implementation to spawn additional workers as needed for scalability.

But to my understanding it doesn't answer the high availability of StreamSets. If I start a pipeline from my instance-a i can request for more workers but what happens if my instance-a goes down for some reason , i cannot see the pipeline on my instance-b; not sure if this is a configuration somewhere. What is the use of deploying StreamSets on to different hosts (unless for standby). Not sure if I missed anything. I followed the steps as mentioned in the link https://streamsets.com/documentation/...

I tried the JDBC origin, it came up with the warning. but what i would like to achieve is to see the same pipelines when I login from different StreamSets nodes. so if one of my node goes down i still have one instance of StreamSets (high availability).

edit retag flag offensive close merge delete

Comments

Which origin are you using? Cluster batch mode works only for Hadoop FS / MapR-FS origins; cluster streaming for Kafka / Map-R Streams. See https://streamsets.com/documentation/datacollector/latest/help/index.html#Cluster_Mode/ClusterPipelines_title.html#concept_fpz_5r4_vs

metadaddy gravatar imagemetadaddy ( 2017-12-21 17:13:32 -0500 )edit

@metadaddy It would be better if there is a tutorial about setup cluster pipeline and run example data processing

casel.chen gravatar imagecasel.chen ( 2018-06-06 20:56:13 -0500 )edit

1 Answer

Sort by ยป oldest newest most voted
0

answered 2018-06-01 18:04:30 -0500

metadaddy gravatar image

StreamSets Control Hub manages multiple Data Collector instances and can fail over jobs from one instance to another if an instance fails. Check out the product page and feel free to request a demo!

edit flag offensive delete link more

Comments

Is StreamSets Control Hub included in Enterprise Platform?

casel.chen gravatar imagecasel.chen ( 2018-06-06 20:57:03 -0500 )edit
Login/Signup to Answer

Question Tools

1 follower

Stats

Asked: 2017-12-21 13:26:02 -0500

Seen: 232 times

Last updated: Jun 01