Ask Your Question

JDBC Multitable Consumer: slow data transfer to Hive

asked 2020-02-01 14:52:10 -0500

Torkia Boussada gravatar image

C:\fakepath\screenshot 1.png(/upfiles/15805900616563994.png) Hello, Yesterday we started data ingestion from Postgres database. The pipeline origin is JDBC Multitable, batch size is default value:1000 records The data transfer was fast the first few hours, then it became slower and slower.

130 millions records were transferred in one day. For the remaining 120 millions, with the current rate, it will take 7 days to complete. Please see attached screenshots for the record throughput.

Please advice!

edit retag flag offensive close merge delete

1 Answer

Sort by ยป oldest newest most voted

answered 2020-02-02 06:40:39 -0500

There are several factors that define the throughput of the pipeline. From the screen shot, I can see that the pipeline is generating idle batches - which means that some active threads are not seeing any data flow.
More information about SDC - Multi table consumer - can be found here : link text

Please verify the number of threads property is enabled and you are tracking offsets ( incremental load ). Each thread reads data from a single table, and each table can have a maximum of one thread read from it at a time.

You can also increase batch size to 10000 rows. Please note you might have to increase the JVM settings on the data collector as well. However, please note that for any changes to take effect, you might have to restart the pipelines.

You can also explore Spark / Sqoop based commands for that specific table to have more control on the flow if desired.


edit flag offensive delete link more
Login/Signup to Answer

Question Tools

1 follower


Asked: 2020-02-01 14:52:10 -0500

Seen: 211 times

Last updated: Feb 01 '20