Ask Your Question

Need to deploy SDC on AWS ec2 cluster and ensuring HA for all of the pipelines.

asked 2018-07-02 01:42:05 -0500

this post is marked as community wiki

This post is a wiki. Anyone with karma >75 is welcome to improve it.

My requirements are to move files across AWS S3 buckets i.e.

Source - AWS S3 bucket Destination - Another AWS S3 bucket

Note - There could be multiple pipelines ranging from 100 to 5000 in our use case, therefore we need it to be distributed across cluster and HA is must.

To do so, i thought to deploy it on AWS EC2 node cluster as in a cluster mode i believed pipe-line could run in distributed manner across all cluster nodes and in a fault tolerant manner enabling high availability as well.I followed below installation steps -

Installation steps -

I have setup a MapR cluster (version 6.0.1) and on top of it installed StreamSets (version 3.3.0) , but when tried to create pipe-line using AWS S3 source and destination , it throws below error.

Error - VALIDATION_0071 - Stage ‘Amazon S3’ from ‘Amazon Web Services 1.11.123’ library does not support ‘Cluster Yarn Streaming’ execution mode.

My question here is there any way available following which i could deploy StreamSets data collector over a cluster in distributed mode ensuring HA?

edit retag flag offensive close merge delete

1 Answer

Sort by » oldest newest most voted

answered 2018-07-02 10:32:38 -0500

metadaddy gravatar image

Cluster Streaming mode supports Kafka/MapR Streams origin only. See the documentation.

You should look into StreamSets Control Hub for managing many pipelines.

edit flag offensive delete link more
Login/Signup to Answer

Question Tools

1 follower


Asked: 2018-07-02 01:42:05 -0500

Seen: 1,068 times

Last updated: Jul 02 '18