More efficient initial load though JDBC possible?

asked 2019-01-16 13:30:50 -0600

pwel gravatar image

When using the JDBCMultitable origin for initially loading an Oracle table over multiple processes using partitioning, streamsets defaults to the table's primary key for deriving the partitions. This works fine and fast for an evenly spreaded ID-column with few gaps and then splitting the table into a couple of small to medium sized partitions looping through index ranges.

But I have very large tables (> 2.5 billion records) not fitting into memory and with a bad clustering-factor between the index and the table (=> you need to physically read table data from disk when looping through the index for each single index entry) and hence the throughput goes down to below 300 records per second in total. So I need a non-nested-loop access to the data to speed it up.

I know I can replace the offset columns with other columns that are not index-bound. But there are none with evently distributed data in my tables. To avoid this, I created a view on top of the table and added a calculated column with a function like "MOD(PK,8) as mymodcol" on the PK column, which distributed the data well and I can define it as a replacement for the PK in streamsets.

For testing, I ran a simple set of queries inside some manually written plsql block (like "select * from mytable where mymodcol >= 0 and mymodcol <1", "select * from mytable where mymodcol >= 1 and mymodcol <2" etc.). It shows 8 full-scans of the table data - one per query. This is good - due to a 500 MB/s disk throughput it reaches around 2.5 million records per second scan-rate even if all data is on disk. Since I split the data into 8 partitions, each one reading the full set of data, I can access appr. 300000 matching records per second in total spread into 8 streams. Potentially almost 1000 times faster than the 300/s records via index access - even I scan the same table 8 times redundantly.

But now, an weird effect takes place. Even if I explicitely add an initial offset to the origin in streamsets and define the partition size to 1, it runs the initial query "select min(mymodcol) from mytable" once per process, wich just adds 8 more full table scans without need. And in addition, I definitely do not need any ORDER BY clause for initial loads. They produce immense additional load and resource consumption on Oracle. If an initial load fails, I like to start it again completely instead of waiting.

Derived from this, I suggest adding the following features to the JDBC Multitable origin:

  • An option for supressing the OrderBy clause per table
  • An option for adding a comment to SQL (useful to add hints in oracle) per table to avoid creating additional views for supressing index access
  • The possibility to also use expressions instead of plain columns as offset columns to avoid creating additional views
  • The possibility to switch off the initial "select min" queries since they multiply ...
(more)
edit retag flag offensive close merge delete

Comments

Hi Peter! Thanks for the taking the time to describe the scenario/use case in such great detail and for your suggestions. Do you mind creating a JIRA ticket for this -- https://issues.streamsets.com? Much appreciated. Thanks again! Cheers, Dash

iamontheinet gravatar imageiamontheinet ( 2019-01-16 13:41:42 -0600 )edit

Sure, thanks.

pwel gravatar imagepwel ( 2019-01-17 02:02:19 -0600 )edit