Reduce "Starting" duration for cluster batch mode

asked 2018-05-30 03:07:03 -0500

davidha gravatar image

Hi, I have been using Streamsets for half year. I found that the "Starting" duration is very long for cluster batch mode job, usually, 2-3 mins or more. That may not be noticeable for huge data set, but now I got plenty of small data to transfer, the actual processing time is acceptable while the 2-3mins "Starting" time is so annoying. Is there any method to eliminate the long waiting time?


edit retag flag offensive close merge delete


What's the version you are on? And what are the stages involved in your pipeline?

Mufy gravatar imageMufy ( 2018-05-30 03:16:23 -0500 )edit

I am using streamsets-datacollector- I have no stage involve. Just from one Origin MapRFS to MapRFS destination. Sometimes the waiting time goes up to 10 mins which is not justifiable. Is there any method to check where is the run struggling at?

davidha gravatar imagedavidha ( 2018-05-31 04:11:30 -0500 )edit

I found significant improvement after I restarting SDC, is it due to running SDC continuously will keep caching something not useful in memory or the JVM?

davidha gravatar imagedavidha ( 2018-05-31 21:18:02 -0500 )edit

I'd recommend picking the latest version of SDC as there have been several enhancements gone into improving the JVM memory management and the likes.

Mufy gravatar imageMufy ( 2018-06-01 01:13:23 -0500 )edit

Thank you for the suggestion. We are having Streamsets on Production usage, may not be able to upgrade very easily. So is the JVM memory management one of the major focus in recent versions? I am thinking of a work around that schedule the SDC to restart to solve the problem, any thoughts on that?

davidha gravatar imagedavidha ( 2018-06-01 03:21:21 -0500 )edit