How we can setup SDC multi-node cluster?
Went through all cluster related information in documentation and posts in this forum and on stackoverflow. But no where it was clear on possibility of setting up streamsets cluster.
Couple of questions:
- Does SDC need an external hadoop cluster to launch map reduce jobs [assume we are not using any hadoop distribution]?
- Does SDC need an external spark cluster to launch streaming jobs [assume we are not using any hadoop distribution]?
- How does it work for this use-case: One of the use case is, We are receiving 1000 files from different upstream systems [by scp] a day in parallel and they are relatively bigger, lets say each file is 1GB to 10GB in size, and we have to apply some transformations on all those files, Later we do some joining and aggregation [we are separating this task out of SDC], Now does it required a bigger machine with number of cores & huge memory on single node? Or is it Possible to setup a cluster with SDC instances like how NiFi does ?
- Incase if we are running this SDC instance in single node for above case, what if for some reason the node is crashed ?