Ask Your Question

Load data from file system and process it efficiently

asked 2020-01-30 23:46:53 -0500

anonymous user


updated 2020-02-04 10:27:26 -0500

metadaddy gravatar image
  1. How to load huge data around 6 million to 10 million (500MB to 1GB) from file system to StreamSets, process it and save it into database efficiently? Currently it is taking around 50 to 70 minutes. The size of data may increase in future.
  2. Number of files: Between 1 to 3 files.
  3. We need to reduce the time from 50 min to 15min.

Below is my pipeline details.

  1. Pipeline general config.
    a) Execution Mode : Standalone
    b) Delivery Guarantee : At least once
  2. Data directory config.
    a) Number of Threads:1
    b) Batch Size (recs):1000
    c) Batch Wait Time (secs): 60
    d) Max Files Soft Limit: 1000
    e) Spooling Period (secs):5
    f) Buffer Limit (KB):128
  3. Javascript evaluator.
    a) We are loading some parametric files(Parametric files contains data on which we are filtering the records. For example valid cost centers) and store it in state variables to filter the records.
    b) We are loading parametric files only when a new file is coming.
  4. The filtered data will be saved in multiple database.

One more issue. We are running around 10 pipelines on a single VM. These pipelines are having less data around 1 million but still the VM CPU utilization is reaching to 700%. VM config: 8 cores 64 GB RAM .Do you have any suggestion on this?

Please suggest what approach should we take to improve our pipeline performance.

I am not able to update the screenshot of my pipeline. Its saying >10 points required to upload files.

edit retag flag offensive close merge delete


I upvoted your question - you should be able to edit it and add a picture now

metadaddy gravatar imagemetadaddy ( 2020-01-31 11:28:26 -0500 )edit

I tried to add screenshot but its showing the same message ">10 points required to upload files".

RaviPrakash gravatar imageRaviPrakash ( 2020-02-03 23:41:10 -0500 )edit

Ah - it's because you posted the question anonymously, so you don't earn reputation when people upvote it. I just gave you some reputation manually

metadaddy gravatar imagemetadaddy ( 2020-02-03 23:42:53 -0500 )edit

Could you please share your view on my post?

RaviPrakash gravatar imageRaviPrakash ( 2020-02-03 23:49:05 -0500 )edit

1 Answer

Sort by » oldest newest most voted

answered 2020-02-04 10:26:19 -0500

metadaddy gravatar image

The most promising area for improvement here is the JavaScript evaluator. Here are some suggestions:

  • If possible, run a test without the evaluator just to see what the potential benefits might be.
  • The Groovy evaluator often performs better than the JavaScript equivalent, so it might be worthwhile porting your logic to Groovy
  • Setting the Record Type advanced property to Data Collector Records avoids some marshaling of data between Data Collector's Java format and script objects, at the expense of a more verbose syntax for accessing record data. See the docs on Accessing Record Details.
  • If you can process the parameters outside the pipeline and set Runtime Parameters, you can avoid the need for an evaluator entirely. You can set Runtime Parameters when you start the pipeline via the CLI or REST API.
  • If you decide that you really need your logic to execute at pipeline runtime, you can code a custom processor in Java. See the tutorial for an example.

Outside the evaluator, there is a time/space tradeoff in the batch size. You can edit to increase the maximum batch size, then increase the batch size in your pipeline. Be aware, though, that the entire batch is held in memory, so you'll need to watch for out-of-memory errors and increase Data Collector's heap size if necessary.

Note - if you are a StreamSets customer, please reach out to your customer success engineer. They'll be happy to review these options (and perhaps others) with you, and work with you to optimize your pipelines.

edit flag offensive delete link more


Thank you so much for your suggestion.

RaviPrakash gravatar imageRaviPrakash ( 2020-02-04 21:32:06 -0500 )edit
Login/Signup to Answer

Question Tools

1 follower


Asked: 2020-01-30 23:46:53 -0500

Seen: 106 times

Last updated: Feb 04