accepted inputs for StreamSets cluster mode

asked 2018-05-09 09:09:36 -0500

ddv gravatar image


I am wondering about "fault tolerance" for StreamSets, particularly in cluster mode.

I have found no doc / page dedicated to this point so far.

Then, I got to this page about "Cluster Batch and Streaming Execution Modes"

The way I understand it is the following:

  • StreamSets in cluster mode runs out of the box when using Kafka or a Hadoop FS as input

  • no word is written about StreamSets in cluster mode with a regular FS as input (with Origin=Directory)

  • then, I understand that the cluster mode + Origin=Directory could be mixed with the following :

(1) one worker (or Edge node if I understand correctly) reads the directory input and publishs filenames in Kafka,

(2) then, we are back to the classic mode (see point above) for which StreamSets in cluster mode reads into Kafka.

So, my questions are:

(a) Is StreamSets in cluster mode runs out of box with Origin=Directory (for regular FS), or is the way I described just above ? or another one ???

(b) is there some doc / page presenting how fault tolerance works for StreamSets and how it is made sure than no data is left lost on a crashed node?


edit retag flag offensive close merge delete