Ask Your Question

How do downstream stages throttle earlier stages

asked 2019-02-10 00:43:20 -0600

rleyba gravatar image


This is a general question about Streamsets pipelines. If a pipeline has a large number of stages, e.g.30, 40, 50 or more, and a stage downstream in the pipeline slows down, how does it communicate to the upstream stages to slow down as well, so that the whole pipeline remains synchronous?

To illustrate, I had a question to this forum recently, and the sample pipeline is below.

image description

If the final output stage (e.g. write to Local FS) slows down considerably, will the intermediate stages upstream from it (e.g. write session to database) or even the input stage also slow down?

What if the input was a "File Tail" input stage and continuously tailing a high volume log file but the post processing was CPU intensive and the output was to write to a FS across the WAN, how would the pipeline handle this?

Thanks for any insights.

edit retag flag offensive close merge delete

1 Answer

Sort by ยป oldest newest most voted

answered 2019-02-11 16:35:26 -0600

jeff gravatar image

Yes, the origin will "slow down" according to how fast the rest of the pipeline processes the batches. This is true whether the origin is single threaded (in which case that thread waits for the batch complete/acknowledgement from all terminal stages before proceeding with the next read from the origin) and multi threaded (where the same thing happens on a per-thread basis).

For some origins, this isn't a problem. For example, a file tail will simply wait to advance its read position until it's time to run again. But for other origins, this can mean messages are not picked up (ex: UDP or TCP server, where live network connections are being refused from the clients' sides). In those cases, you will need to configure higher availability of your processing capacity by running more pipelines in parallel, putting a load balancer in front.

edit flag offensive delete link more


Hi Jeff, Thanks for this comprehensive explanation. It is good to know that for the most part, the pipelines will be "self limiting". This also highlights the importance of (in my case) using Apache Kafka to "buffer" any input/output data flow mismatch and level off peaks and valleys in flows.

rleyba gravatar imagerleyba ( 2019-02-12 06:09:47 -0600 )edit
Login/Signup to Answer

Question Tools

1 follower


Asked: 2019-02-10 00:43:20 -0600

Seen: 27 times

Last updated: Feb 11