Ask Your Question
1

How to configure a workflow of pipelines in SDC?

asked 2019-09-18 11:28:54 -0500

Torkia Boussada gravatar image

updated 2019-09-18 14:08:35 -0500

metadaddy gravatar image

Hello,

We have built around 250 pipelines in SDC.

As there are data dependencies between these pipelines, we have a job that kick-off the first pipeline, and that one will call other pipelines and so on. An example of dependency is one pipeline pushes data into a hive table that is used by the next pipeline.

We are using HTTP Client stage to call the next pipeline from the event (no-more-data) on the origin stage. Pipeline IDs are stored in a hive table (stage.pipelines). Attached a sample pipeline.

When testing the workflow we noticed that the next pipeline is kicked-off when the current pipeline finishes reading data from the origin, means before finishing writing data to Hive table. This defect is causing a data loss.

My question:

  1. Is there a better way to setup a workflow of pipelines in streamsets data collector. We need the next pipeline to start when the current one finished, all data should be transferred to edh before calling the next one.
  2. We don't want to hardcode the pipeline ID in the call url, we do lookup on a hive table to get the ID from the title
  3. We need to be able to pass parameters to the next pipeline
  4. We don't want the next pipeline to kick-off when the current one fails or get stopped.

Please let me know if you need more information.

Regards,

Torkia

edit retag flag offensive close merge delete

1 Answer

Sort by ยป oldest newest most voted
1

answered 2019-09-18 14:05:20 -0500

metadaddy gravatar image

StreamSets Data Collector 3.11.0, due for release in early October, will include a new 'Orchestrator' stage library containing five new stages:

  • Cron Scheduler origin - Generates a record with the current datetime based on a Cron expression
  • Start Pipeline origin & processor - Starts a Data Collector, Transformer or Edge pipeline
  • Start Job - Starts a Control Hub job
  • Control Hub API processor - Calls Control Hub APIs

The Start Pipeline and Start Job stages can be configured to wait for the pipeline/job to finish before passing the record along the pipeline, or run the pipeline/job in the background, passing the record along as soon as the start instruction is sent. They can also pass runtime parameters into the pipeline/job.

These new stages will enable use cases such as:

Chaining Pipelines

image description

Control Hub Jobs

image description

At the time of writing, the Orchestrator library is available for evaluation in the nightly build of Data Collector.

I think the Orchestrator library should help with your use case.

edit flag offensive delete link more

Comments

Hello Pat, Thanks for your time and all the information you have provided here. With my case, I think that Orchestrator library would help. I have another question: If the chaining pipeline fails at any step, is there an option to resume from the failed one? Thanks again

Torkia Boussada gravatar imageTorkia Boussada ( 2019-09-18 19:58:09 -0500 )edit

Currently no; you can check the status of the pipeline and use stream selector processor to do if-else.

metadaddy gravatar imagemetadaddy ( 2019-09-18 21:59:18 -0500 )edit
Login/Signup to Answer

Question Tools

1 follower

Stats

Asked: 2019-09-18 11:28:54 -0500

Seen: 141 times

Last updated: Sep 18