Question about thread management in StreamSets Data Collector

asked 2018-02-13 10:46:10 -0600

updated 2018-02-13 20:49:21 -0600

We are using StreamSets to migrate data from a local FS to the Azure data lake store. The pipeline is very simple, a directory origins followed by an azure data lake store destination.

About the origin, we've increased parallelism setting the number of threads to 10. As a consequence, the sdc executes 10 runners.

Monitoring the sdc from the SDC metrics portal, especially the Threads graph, we noticed that the number of live threads is constantly increasing. looking at the Threads Dump, the majority of them is a "Data Lake Idle Close Thread"; moreover it seems that the sdc keeps this thread alive even if the file on which the thread is listening to is closed.

is this behavior normal? i mean, is normal that sdc keeps this thread alive even if the file is closed? Moreover, is there a way to kill this useless threads?

I'm asking this because we've experienced anOutOfMemoryException: unable to create new native thread error and we did not find anything on the internet to solve this problem.

1 Answer

answered 2018-02-13 20:50:28 -0600

This sounds like it could be a bug in the Azure Data Lake destination. Can you please open a Jira and include the information here, as well as a thread dump showing the problem?

Asked: 2018-02-13 10:46:10 -0600

Last updated: Feb 13 '18