How does StreamSets handle throttling large numbers of pipelines?

2018-08-08 11:00:14

dcwatson84

2018-08-08 13:18:04

metadaddy

If I had 100k pipelines, and needed them all to run once an hour, and I triggered them all to run at XX:00:00, how would StreamSets handle throttling the system when that load becomes too much? Obviously the CPU has limitations in terms of parallel threads, but that level of throttling isn't enough to keep a system from crashing. If streamsets actually attempted to execute all 100k then regardless of CPU capabilities, memory could quickly get used up, which would cause lots of swapping and inefficient use of resources.

So the only options I can come up with are...

  1. It queues some.
  2. It fails some.
  3. It runs them all.

If it's #1 or #2, then what logic is used to determine when something is queued or failed? If it's #3 then the implication is that even with relatively lightweight pipelines, it's possible for streamsets to bog down simply by kicking off too many at once?

2018-08-08 13:17:52

metadaddy

At present, the answer is #3 - Data Collector does exactly what you tell it to do. It's up to you to schedule your pipelines appropriately.

There is an open Jira to at least enable throttling of the pipeline starts. Please watch if interested:

jeff ( 2018-08-08 14:06:40 -0500 )

That's unfortunate. So given that the DC is easy to bog down, is there an API to monitor it so that we can prevent that? Even if I queued the work externally, I still need to be able to detect when the DC can handle more work.

dcwatson84 ( 2018-08-08 18:16:10 -0500 )

I think you can get pretty much every stat via a JMX request - http://hostname:18630/rest/v1/system/jmx

metadaddy ( 2018-08-08 18:28:12 -0500 )
Asked: 2018-08-08 11:00:14

Seen: 875 times

Last updated: Aug 08 '18