I am exploring streamsets and have few questions:-

  1. First and foremost how to deploy streamsets data pipeline in production environment, what is the best practice, does open source streamsets data collector has option to deploy on kubernetes? I have 5 node greenplum cluster can I deploy streamset cluster mode in the same?
  2. I have main use case to get data from activemq and mysql and ingest into greenplum does streamsets has this?
  3. What is the best way to generate summary table data through streamsets data collector?
  4. How to monitor streamsets jobs, how to put retry mechanism, failover etc?
  1. StreamSets Control Hub includes Kubernetes integration. If you are just using Data Collector you will need to deploy manually, or build the automation yourself.
  2. Yes - you can use the JMS Consumer origin and JDBC Query Consumer or JDBC Multitable Consumer origins to read data, then the JDBC Producer destination to write to Greenplum.
  3. What are you looking for here? Something like Data Delivery Reports?
  4. Check out the docs on Pipeline Monitoring.
Thanks for your response I have a really high data load around 1-2 tab a day does SDC handle this with all custom business rules (cleansing, transformation etc) and if yes so what is the best approach for this?

@ankitbeohar90 That's probably more than can fit into Ask StreamSets - please drop me an email at If you include details of your company and location, I can connect you to the right person to help.

