Ask Your Question

Question about thread management in StreamSets Data Collector

asked 2018-02-13 10:46:10 -0600

r.fortino gravatar image

updated 2018-02-13 20:49:21 -0600

jeff gravatar image

We are using StreamSets to migrate data from a local FS to the Azure data lake store. The pipeline is very simple, a directory origins followed by an azure data lake store destination.

About the origin, we've increased parallelism setting the number of threads to 10. As a consequence, the sdc executes 10 runners.

Monitoring the sdc from the SDC metrics portal, especially the Threads graph, we noticed that the number of live threads is constantly increasing. looking at the Threads Dump, the majority of them is a "Data Lake Idle Close Thread"; moreover it seems that the sdc keeps this thread alive even if the file on which the thread is listening to is closed.

is this behavior normal? i mean, is normal that sdc keeps this thread alive even if the file is closed? Moreover, is there a way to kill this useless threads?

I'm asking this because we've experienced anOutOfMemoryException: unable to create new native thread error and we did not find anything on the internet to solve this problem.

edit retag flag offensive close merge delete

1 Answer

Sort by ยป oldest newest most voted

answered 2018-02-13 20:50:28 -0600

jeff gravatar image

This sounds like it could be a bug in the Azure Data Lake destination. Can you please open a Jira and include the information here, as well as a thread dump showing the problem?

edit flag offensive delete link more
Login/Signup to Answer

Question Tools

1 follower


Asked: 2018-02-13 10:46:10 -0600

Seen: 8,020 times

Last updated: Feb 13 '18