Ask Your Question

Streamsets Data Collector Cluster Mode

asked 2017-11-24 01:32:00 -0500

Vivian Y gravatar image

updated 2017-11-29 12:57:55 -0500

metadaddy gravatar image

How does streamsets cluster mode works? The clustering mode is actually refering to pipeline or the services we use to get data or send data, like kafka or hdfs?

edit retag flag offensive close merge delete

1 Answer

Sort by ยป oldest newest most voted

answered 2017-11-27 12:20:35 -0500

metadaddy gravatar image

In Cluster Mode, SDC runs via one of the following mechanisms:

  • In Cluster Batch mode, SDC runs as a map-only application on MapReduce, on top of YARN. When you start the pipeline, the standalone SDC instance in which you are working bundles up the necessary jar files and submits the job to YARN. YARN and MapReduce create one task for each HDFS / MapR FS block, so the result is that SDC is running on many nodes in the cluster.
  • In Cluster Streaming mode, SDC runs as an application within Spark Streaming, using either YARN or Mesos as the cluster manager. The cluster manager and Spark Streaming spawn an SDC worker for each topic partition in the Kafka cluster, so each partition has an SDC worker processing data.

So the answer to your second question is, 'both' - the pipeline is running on the cluster and using cluster-specific services such as HDFS and Kafka partitions.

See the Cluster Mode documentation for more info.

edit flag offensive delete link more
Login/Signup to Answer

Question Tools

1 follower


Asked: 2017-11-24 01:32:00 -0500

Seen: 2,122 times

Last updated: Nov 27 '17