Ask Your Question

Why can each pipeline only contain one origin?

asked 2017-12-13 02:40:13 -0600

Vivian Y gravatar image

updated 2017-12-13 11:07:00 -0600

metadaddy gravatar image

Is there any reason why each pipeline in StreamSets can contain only one origin instead of multiple origins? For example, merge origin 1 with origin 2 into one data stream and process in the pipeline?

Is there any reason or limitation why we can't have multiple origins?

edit retag flag offensive close merge delete

2 Answers

Sort by ยป oldest newest most voted

answered 2017-12-13 11:16:06 -0600

jeff gravatar image

updated 2017-12-13 11:17:34 -0600

The fundamental problem is in the "merge" process you allude to. Presumably, combining data from multiple origins would require some kind of matching on particular "key" field(s). To support such an operation, SDC would need to be able to access all parsed records from all origins for an indefinite period of time (in case newer records need to "join" to much older ones from another origin). That leads to significant complexities and questions without clear answers.

How would SDC retain all these records? If it was to keep everything in-memory, that would be too expensive and require too much memory as pipelines continue running. If there was some kind of mechanism to store records in persistent storage to support the joins, then SDC would basically either implement its own database (along with all the desirable properties around reliability/failover/availability) or require users to provide their own. In either case, it would greatly increase the complexity of the codebase and impose infrastructure requirements that many users would find intolerable.

There's also the question of how to handle failure recovery. What should happen if a multi-origin pipeline fails? Would it need to start over from the beginning of each origin's file/database table/etc.? How else could we be sure that we capture the necessary records to perform joins? Furthermore, if we parse a record from an origin that we expect to join to another origin's records, how can we know whether that second record doesn't exist or simply hasn't been read yet? To account for the latter possibility, we would basically have to pause the entire pipeline (producing no merged records) until every origin is completely read.

It's possible that we can come up with answers to these challenges, most likely with significant caveats (to allow for a reasonable implementation). It's also possible that there is an entirely different paradigm to approach this. In either case, I hope this answers the question as to "why not" in the current state of the application.

edit flag offensive delete link more


I arrived at this page because I had the same question. I understand your reasoning, but there is one flaw with it, I believe - which is my use case. I want to UNION two data sets from two different sources together. Enrich a data set with additional rows rather than fields.

Andelu gravatar imageAndelu ( 2019-02-27 18:46:50 -0600 )edit

You might want to look into StreamSets' new product Transformer which allows you to have multiple origins and provides built-in processors like Union and Join --

iamontheinet gravatar imageiamontheinet ( 2020-06-11 18:41:55 -0600 )edit

answered 2018-06-05 13:08:43 -0600

bob gravatar image

Depending on your definition of "merge" - it could be the case that you want to read a record and add additional fields to that record from another data source. Instead of "merging", this might be considered "enriching" the records adding information from another data source to each incoming record. In this case, you can use lookup processors, such as JDBC Lookup processor to "enrich" the record stream with additional information.

edit flag offensive delete link more
Login/Signup to Answer

Question Tools

1 follower


Asked: 2017-12-13 02:40:13 -0600

Seen: 1,920 times

Last updated: Jun 05 '18