Ask Your Question
1

Slow ingest from PostgreSQL CDC origin

asked 2019-06-09 21:59:07 -0500

Avi gravatar image

updated 2019-06-10 09:13:32 -0500

metadaddy gravatar image

I am creating load on PostgreSQL DB and using CDC replicating it to Hadoop. 1 minute of running the load (40K records) takes about 60 minutes to replicat to Hadoop. The real time statistics is showing increments of 100 on batch input/output size.
In the sdc.properties file: production.maxBatchSize=1000

When I tried to increase production.maxBatchSize to higher value (5000) is it still doing batches of 100 records

How can I tune Streamsets performance ? Should I set batch size to a higher value ? and if yes, how ?

edit retag flag offensive close merge delete

1 Answer

Sort by ยป oldest newest most voted
0

answered 2019-06-10 09:12:41 -0500

metadaddy gravatar image

updated 2019-06-11 12:36:26 -0500

You need to set the batch size in the PostgreSQL CDC origin as well as production.maxBatchSize. In the origin configuration, go to the JDBC tab and set Max Batch Size (records) to a higher value.

We actually just fixed SDC-11757 in this area, so it's possible you're seeing that. The fix will be in StreamSets Data Collector versions 3.9.1 and 3.10.0.

edit flag offensive delete link more

Comments

I configured the "Max Batch Size (records)" to 5000 and it is still doing batches of 100 records every 10 seconds Please advise

Avi gravatar imageAvi ( 2019-06-11 12:59:33 -0500 )edit

See the second paragraph that I added to my answer ^^^

metadaddy gravatar imagemetadaddy ( 2019-06-11 13:02:06 -0500 )edit

Thank you Pat ! I will test it on 3.10.0

Avi gravatar imageAvi ( 2019-06-11 15:14:33 -0500 )edit

Do you have release dates for versions 3.9.1 and 3.10.0 ?

Avi gravatar imageAvi ( 2019-06-11 15:28:28 -0500 )edit
Login/Signup to Answer

Question Tools

1 follower

Stats

Asked: 2019-06-09 21:59:07 -0500

Seen: 565 times

Last updated: Jun 11