Ask Your Question

How we can setup SDC multi-node cluster?

asked 2018-04-05 03:43:56 -0500

LiveMore gravatar image

updated 2018-04-05 11:24:07 -0500

metadaddy gravatar image

Went through all cluster related information in documentation and posts in this forum and on stackoverflow. But no where it was clear on possibility of setting up streamsets cluster.

Couple of questions:

  1. Does SDC need an external hadoop cluster to launch map reduce jobs [assume we are not using any hadoop distribution]?
  2. Does SDC need an external spark cluster to launch streaming jobs [assume we are not using any hadoop distribution]?
  3. How does it work for this use-case: One of the use case is, We are receiving 1000 files from different upstream systems [by scp] a day in parallel and they are relatively bigger, lets say each file is 1GB to 10GB in size, and we have to apply some transformations on all those files, Later we do some joining and aggregation [we are separating this task out of SDC], Now does it required a bigger machine with number of cores & huge memory on single node? Or is it Possible to setup a cluster with SDC instances like how NiFi does ?
  4. Incase if we are running this SDC instance in single node for above case, what if for some reason the node is crashed ?
edit retag flag offensive close merge delete

1 Answer

Sort by » oldest newest most voted

answered 2018-04-05 11:32:28 -0500

metadaddy gravatar image
  1. Yes - Data Collector submits MR jobs to an existing YARN cluster.
  2. Yes - Data Collector submits Spark jobs to an existing YARN or Mesos cluster.
  3. You could set up a Hadoop cluster for this, or enable multi-threading in the Directory origin and do it on a single large machine. You can use StreamSets Control Hub to manage a cluster of Data Collector instances, but you would need some strategy for partitioning the input data - for example, writing the files to different directories, each Data Collector reading from its own directory.
  4. Data Collector tracks the offset as it reads files. If it crashes while reading a file it will load the last offset it saved and continue from there.

One caveat when using the Directory origin - be sure to move files into the directory as an atomic operation. If you copy files into a directory that Data Collector is monitoring then it may start to read the data before the file is fully written. If you're copying files between machines with scp, you need to copy them to a temporary location on the same filesystem as the directory that Data Collector is monitoring, then use mv to move them.

edit flag offensive delete link more


it explains everything and thanks for it. thanks for highlighting on the scp copy, we will follow as suggested with mv command, does streamsets have any future plan to have standalone cluster like how NiFi does, like not depending on hadoop or spark cluster ?

LiveMore gravatar imageLiveMore ( 2018-04-06 02:24:15 -0500 )edit

Yes check out StreamSets Control Hub.

tmcgrath gravatar imagetmcgrath ( 2018-04-09 16:48:08 -0500 )edit

Understood now, thanks.

LiveMore gravatar imageLiveMore ( 2018-04-10 02:43:04 -0500 )edit
Login/Signup to Answer

Question Tools

1 follower


Asked: 2018-04-05 03:36:10 -0500

Seen: 40 times

Last updated: Apr 05