Ask Your Question

How to use StreamSets for large projects?

asked 2018-02-13 05:14:16 -0600

ckayay gravatar image

updated 2018-02-13 20:53:23 -0600

jeff gravatar image

I have some question on StreamSets as I am newbie.

  1. In an enterprise where there are databases and SQL statements are used to pull the data using Sqoop, how StreamSets Data Ingestion Pipeline can be created. How Sqoop will be monitored using StreamSets?
  2. If one needs to build a Data Pipeline by listening Traditional Message Broker (JMS Source), Pull data from Databases (Sqoop), Tail Files, there needs to be several Data Pipelines to be created for each source data? And how StreamSets Data Pipeline will scale itself? How this scaling might work in case SDC uses Http/TCP listeners on itself to listen for data?
  3. When deploying StreamSets, I assume we need to identify specific nodes in HA which can scale with additional nodes/instances just to ingest data into Kafka from data sources? Then YARN Cluster setup can be used that runs on SPARK/Hadoop nodes which is required for the Data Processing from Kafka into several Sinks Hadoop/Elastic through Spark?
  4. How Aggregator can maintain state across several pipelines when considering scaled out instances?
  5. How Could I implement the following scenario in enterprise environment that supports failover, parallelism, etc.: Consume data from JMS queues and for each queue sink to Kafka Topics (Ingestion Pipeline) and the Read Kafka Topics in parallel and convert each XML document read to run some business logic (in Java or Spark which can scale the processing) on and convert into JSON again in parallel processing fashion, and sink each document to Elastic Search. (From each Kafka partition some batch amount of data will be consumed and I need to process each batch and write individually to Elastic Search in multithreaded fashion as we have seen big performance bottleneck using Elastic Bulk API to write in batch).
edit retag flag offensive close merge delete

2 Answers

Sort by ยป oldest newest most voted

answered 2018-02-13 21:05:35 -0600

jeff gravatar image

First, for next time, it may be better to split this into multiple questions (as it is actually five different ones). I will take a crack at those I know.

1) Sqoop jobs themselves, being MapReduce jobs, cannot be monitored by StreamSets. That's because, like all MapReduce jobs, they are monitored on your particular Hadoop cluster's management system. However, we do have a tool that can build an equivalent StreamSets Data Collector pipeline from a Sqoop command, and at that point, you have a normal SDC pipeline that you can run and manage instead of the Sqoop job. For more information on that functionality, see here.

2) Yes, each pipeline can only support one distinct origin. For more background on why that is the case, see this question and answer. With regard to scaling, many individual origins in StreamSets Data Collector support multithreading out of the box (ex: HTTP, UDP, JDBC Multi Table source, Kafka multi topic consumer, local directory origin). For those origins that don't support this, you can clone the pipeline and run many instances on different SDC instances. Or, you can use SCH to manage these in a more holistic and straightforward way.

5) The JMS consumer currently is only single threaded (comment/watch this Jira to add multi threading support). That means you will need to have multiple pipeline instances to consume in a parallel fashion (see above). For the pipeline reading from Kafka, however, you can configure multiple threads if using the Kafka multi topic consumer origin. For the intermediate processing in that pipeline, you can use a variety of processors already available (including invoking existing Spark jobs or scripting in Jython, Javascript, or Groovy), or write your own processor if those are insufficient. Then you will connect the Elasticsearch destination and configure batch size and other parameters as desired.

edit flag offensive delete link more

answered 2018-02-14 00:46:01 -0600

ckayay gravatar image

Thanks a lot for the answers. So my understanding is that if there 20-30 JMS Queues, Sqoop Jobs, Files to ingest, there will be tens of ingestion pipes where each instance will need to run individually. For Kafka Topics with each having several partitions again the same approach should be carried out. All these StreamSets instances will run on separate servers then? Meaning, we have Spark, Hadoop nodes and I assume, we can use the same Spark nodes and install Stream Set instances to run all these ingestion pipelines. And for Processing the same Spark nodes can be used...

The advantage of using StreamSets seems that, * It is easy to do the plumbing and making changes on the pipelines as it is visual * Can see the performance, SLA's of the pipelines with Performance metrics * Can develop the pipeline in shorter period of time because of existing tools * ???

edit flag offensive delete link more
Login/Signup to Answer

Question Tools

1 follower


Asked: 2018-02-13 05:12:39 -0600

Seen: 15 times

Last updated: Feb 14