Ask Your Question

What can be done to improve performance in SDC?

asked 2018-02-27 03:35:53 -0500

grace gravatar image

updated 2018-02-28 13:56:22 -0500

metadaddy gravatar image

Data ingestion in StreamSets is found to be very slow compared to Nifi. Ingesting 1.8 GB file from directory to HDFS took 8.27 minutes in StreamSets while in Nifi it took 83 seconds .What could be done to improve performance in StreamSets?

Configurations tried in StreamSets

  1. batch size: 1000,10000,10000000
  2. Number of threads: 1, 5, 10
  3. Batch wait time: 1sec, 10 sec, 60 sec
  4. Buffer limit: 128, 1024

More info:

  1. The data format that we are consuming is Delimited (CSV).
  2. We are not using any processors.
  3. Destination is Hadoop FS: Stage Library CDH 5.9.2 and Data format is Delimited (csv) all other configurations are default.
  4. Only Input DIrectory (Origin) and Destination are there in the pipeline so number of processed inputs and outputs are always same, we couldn't see any specific stage information in pipeline metrics.

Could you please suggest any method to track stage wise processing time?

edit retag flag offensive close merge delete

1 Answer

Sort by ยป oldest newest most voted

answered 2018-02-27 13:52:45 -0500

jeff gravatar image

updated 2018-02-28 13:56:59 -0500

metadaddy gravatar image

There is some basic information on performance in the documentation here. Can you have a read through that? Also, there is a lot more information needed to form a cogent answer. What is the data format you are consuming from the directory? What processors, if any, are involved and how are they configured? What is the destination and how is it configured? Have you checked the pipeline metrics during running to see what stages are taking what share of the total processing time?

When the pipeline is running, look for the _Stage Batch Processing Timer (in seconds)_. It is a pie chart on the Monitoring/Summary tab.

My guess is that the parsing and generation of CSV are slowing you down. If you are not actually doing anything other than directly copying the records, why not just use TEXT as the format and send the lines straight through that way? Will not incur the parsing overhead.

edit flag offensive delete link more
Login/Signup to Answer

Question Tools

1 follower


Asked: 2018-02-27 03:35:53 -0500

Seen: 1,220 times

Last updated: Feb 28 '18