Ask Your Question

What is the architecture design of StreamSets Data Collector?

asked 2017-12-07 10:32:47 -0600

aman gravatar image

updated 2017-12-08 13:29:40 -0600

metadaddy gravatar image

I am not very clear about the architecture even after going through tutorials. How do we scale streamset in a distributed environment? Let's say, our input data velocity increases from origin then how to ensure that SDC doesn't give performance issues? How many daemons will be running? Will it be Master worker architecture or peer to peer architecture?

If there are multiple daemons running on multiple machines (e.g. one sdc along with one NodeManager in YARN) then how it will show centralized view of data i.e. total record count etc.?

Also please do let me know architecture of Dataflow performance manager. Which all daemons are there in this product?

edit retag flag offensive close merge delete

2 Answers

Sort by ยป oldest newest most voted

answered 2018-09-24 07:26:18 -0600

todd gravatar image

By the way, in the time between when this question has been asked and answered, StreamSets Contol Hub has been released. For anyone reading this now, Control Hub can provide the orchestration of multiple Data Collectors directly or through Kubernetes. Consider it when needing horizontal scale and failover similar to existing distributed cluster frameworks.

edit flag offensive delete link more

answered 2017-12-08 13:29:31 -0600

metadaddy gravatar image

StreamSets Data Collector (SDC) scales by partitioning the input data. In some cases, this can be done automatically, for example Cluster Batch mode runs SDC as a MapReduce job on the Hadoop / MapR cluster to read Hadoop FS / MapR FS data, while Cluster Streaming mode leverages Kafka partitions and executes SDC as a Spark Streaming application to run as many pipeline instances as there are Kafka partitions.

In other cases, StreamSets can scale by multithreading - for example, the HTTP Server and JDBC Multitable Consumer origins run multiple pipeline instances in separate threads.

In all cases, Dataflow Performance Manager (DPM) can give you a centralized view of the data, including total record count.

edit flag offensive delete link more


Does it mean that StreamSets is not distributed ? I deployed streamsets to two nodes on my cloudera cluster and the two nodes are acting as seperate instances. I couldn't find the documentation how to make it a cluster. Could you throw some light on this or point me to the right documentation. i f

krishnaM gravatar imagekrishnaM ( 2017-12-20 19:53:37 -0600 )edit
Login/Signup to Answer

Question Tools



Asked: 2017-12-07 10:32:47 -0600

Seen: 1,606 times

Last updated: Sep 24 '18