CDC Oracle Origin is loosing data

asked 2020-01-09 10:23:27 -0500

StreamNooby gravatar image


We have a pipeline with an Oracle CDC Origin sending data to a Kafka topic. Besides this topic we also have another one that reads the data from the Kafka topic and writes them to a Kudu table.

The problem we are facing is that while writing to the Kudu table there are erros due to missing records. Basically we have UPDATE records coming from the Kafka topic that can't be executed because on the Kudu table the INSERT record was never done.

When checking the data on the oracle redo logs we noticed that the INSERT records have the timestamp in the redo logs seconds before the log file change to another file. My guess is that the "Oracle CDC Client" is not being fast enough reading the records to get all of them before the log file change.

By the way, the tables we are talking about have more than 300 columns.

Does anyone can think of something that might be causing this and how to avoid it ?


Any idea of what might be causing this ?

answered 2020-01-09 19:21:19 -0500

shixinbao gravatar image

updated 2020-01-09 19:26:31 -0500

I also encountered this problem(loss data) with postgresql cdc client.If you solve it, please let me know.At first I thought no wal logs were generated, so I compared debezium with streamsets. The result showed that Debezium got all the data and streamsets lost some data.

Hi, could you tell me in which scenarios CDC can loss data? thanks

Dean Han gravatar imageDean Han ( 2020-01-12 08:27:56 -0500 )edit

Hi Dean.. The records we are loosing occur always prior to the redo logs change. For example, the record is inserted at 13:44 and the redo log change occured at "Sat Jan 04 13:43:53 2020 Thread 3 advanced to log sequence 117902 (LGWR switch)"

StreamNooby gravatar imageStreamNooby ( 2020-01-14 06:04:52 -0500 )edit
