Ask Your Question

Revision history [back]

click to hide/show revision 1
initial version

First, for next time, it may be better to split this into multiple questions (as it is actually five different ones). I will take a crack at those I know.

1) Sqoop jobs themselves, being MapReduce jobs, cannot be monitored by StreamSets. That's because, like all MapReduce jobs, they are monitored on your particular Hadoop cluster's management system. However, we do have a tool that can build an equivalent StreamSets Data Collector pipeline from a Sqoop command, and at that point, you have a normal SDC pipeline that you can run and manage instead of the Sqoop job. For more information on that functionality, see here.

2) Yes, each pipeline can only support one distinct origin. For more background on why that is the case, see this question and answer. With regard to scaling, many individual origins in StreamSets Data Collector support multithreading out of the box (ex: HTTP, UDP, JDBC Multi Table source, Kafka multi topic consumer, local directory origin). For those origins that don't support this, you can clone the pipeline and run many instances on different SDC instances. Or, you can use SCH to manage these in a more holistic and straightforward way.

5) The JMS consumer currently is only single threaded (comment/watch this Jira to add multi threading support). That means you will need to have multiple pipeline instances to consume in a parallel fashion (see above). For the pipeline reading from Kafka, however, you can configure multiple threads if using the Kafka multi topic consumer origin. For the intermediate processing in that pipeline, you can use a variety of processors already available (including invoking existing Spark jobs or scripting in Jython, Javascript, or Groovy), or write your own processor if those are insufficient. Then you will connect the Elasticsearch destination and configure batch size and other parameters as desired.