How to setup StreamSet in HA mode on AWS EC2 instances distributed across all participating nodes of the cluster?

asked 2018-06-26 02:45:37 -0500

this post is marked as community wiki

This post is a wiki. Anyone with karma >75 is welcome to improve it.

Need to setup a pipeline using StreamSet data collector which streams data from source S3 bucket to another destination S3 bucket, i am successfully able to install Streamset on my local in standalone execution mode, but wanted to know how could i setup this data collector on AWS EC2 cluster (multiple nodes) for prod usage, so that it could be executed in distributed mode.

Are there any specific guidelines?

Let me briefly explain my use case -

  1. There could be numerous pipelines, ranging from 100 to 10,000. All running simultaneously , so if SDC is installed only on single machine, then it is not going to scale, therefore we need to deploy it on some cluster of nodes.
  2. So just wanted to know that, how SDC is going to distribute and persist pipe-line meta-information, offset and cron expression related information across the cluster?
  3. Simply we need to scale out SDC across multiple nodes, so that SDC service is highly available all time.How it can be done?
edit retag flag offensive close merge delete