Ask Your Question

Revision history [back]

In Cluster Mode, SDC runs via one of the following mechanisms:

  • In Cluster Batch mode, SDC runs as a map-only application on MapReduce, on top of YARN. When you start the pipeline, the standalone SDC instance in which you are working bundles up the necessary jar files and submits the job to YARN. YARN and MapReduce create one task for each HDFS / MapR FS block, so the result is that SDC is running on many nodes in the cluster.
  • In Cluster Streaming mode, SDC runs as an application within Spark Streaming, using either YARN or Mesos as the cluster manager. The cluster manager and Spark Streaming spawn an SDC worker for each topic partition in the Kafka cluster, so each partition has an SDC worker processing data.

So the answer to your second question is, 'both' - the pipeline is running on the cluster and using cluster-specific services such as HDFS and Kafka partitions.

See the Cluster Mode documentation for more info.