Ask Your Question

Revision history [back]

click to hide/show revision 1
initial version

How to use StreamSets for large projects?

I have some question on StreamSets as I am newbie.
1) In an enterprise where there are databases and SQL statements are used to pull the data using Sqoop, how StreamSets Data Ingestion Pipeline can be created. How Sqoop will be monitored using StreamSets?

\n

2) If one needs to build a Data Pipeline by listening Traditional Message Broker (JMS Source), Pull data from Databases (Sqoop), Tail Files, there needs to be several Data Pipelines to be created for each source data? And how StreamSets Data Pipeline will scale itself? How this scaling might work in case SDC uses Http/TCP listeners on itself to listen for data? 3) When deploying StreamSets, I assume we need to identify specific nodes in HA which can scale with additional nodes/instances just to ingest data into Kafka from data sources? Then YARN Cluster setup can be used that runs on SPARK/Hadoop nodes which is required for the Data Processing from Kafka into several Sinks Hadoop/Elastic through Spark? 4) How Aggregator can maintain state across several pipelines when considering scaled out instances? 5) How Could I implement the following scenario in enterprise environment that supports failover, parallelism, etc.: Consume data from JMS queues and for each queue sink to Kafka Topics (Ingestion Pipeline) and the Read Kafka Topics in parallel and convert each XML document read to run some business logic (in Java or Spark which can scale the processing) on and convert into JSON again in parallel processing fashion, and sink each document to Elastic Search. (From each Kafka partition some batch amount of data will be consumed and I need to process each batch and write individually to Elastic Search in multithreaded fashion as we have seen big performance bottleneck using Elastic Bulk API to write in batch).

click to hide/show revision 2
None

How to use StreamSets for large projects?

I have some question on StreamSets as I am newbie.
1)

  1. In an enterprise where there are databases and SQL statements are used to pull the data using Sqoop, how StreamSets Data Ingestion Pipeline can be created. How Sqoop will be monitored using StreamSets?

    \n

    2)

  2. If one needs to build a Data Pipeline by listening Traditional Message Broker (JMS Source), Pull data from Databases (Sqoop), Tail Files, there needs to be several Data Pipelines to be created for each source data? And how StreamSets Data Pipeline will scale itself? How this scaling might work in case SDC uses Http/TCP listeners on itself to listen for data? 3) data?
  3. When deploying StreamSets, I assume we need to identify specific nodes in HA which can scale with additional nodes/instances just to ingest data into Kafka from data sources? Then YARN Cluster setup can be used that runs on SPARK/Hadoop nodes which is required for the Data Processing from Kafka into several Sinks Hadoop/Elastic through Spark? 4) Spark?
  4. How Aggregator can maintain state across several pipelines when considering scaled out instances? 5) instances?
  5. How Could I implement the following scenario in enterprise environment that supports failover, parallelism, etc.: Consume data from JMS queues and for each queue sink to Kafka Topics (Ingestion Pipeline) and the Read Kafka Topics in parallel and convert each XML document read to run some business logic (in Java or Spark which can scale the processing) on and convert into JSON again in parallel processing fashion, and sink each document to Elastic Search. (From each Kafka partition some batch amount of data will be consumed and I need to process each batch and write individually to Elastic Search in multithreaded fashion as we have seen big performance bottleneck using Elastic Bulk API to write in batch).