Ask Your Question

Multitable CDC to Kafka in AVRO format not possible?

asked 2018-12-10 11:38:41 -0600

pwel gravatar image

updated 2018-12-13 01:28:58 -0600


no answer so far, hence I simplifiy my question:

  • I capture records of 100 tables in one CDC stage (possible with the CDC stages)
  • I need to write the records to 100 corresponding Kafka Topics (possible since I can derive the kafka topic from each record-header)
  • I need to use the AVRO recordtype for the Kafka Topics (doesn't seem to be possible?)

It's a common task in CDC to capture schematized data of many tables and it's also common to use AVRO in kafka as destination. How can I do that with SDC?

If it is not possible, I'd like to suggest implement something like this:

  • Allow to derive/define the Schema ID / Subject dynamically per record
  • load the avro schema from the schema-registry when accessing a record of a certain schema for the first time
  • Cache this schema (for a certain, configurable time - e.g. 60 minutes) within the pipeline
  • use it with the according record

Thanks in advance and Regards, Peter

edit retag flag offensive close merge delete

1 Answer

Sort by ยป oldest newest most voted

answered 2019-01-23 08:41:28 -0600

pwel gravatar image

Since there is no answer so far, I'd like to summarize our (still open) issue here as an answer:

  • We generally source all our data through kafka (for various reason like unification, streaming capabilities etc.) in various projects and industries
  • We must have a small footprint with low overhead due to the amount of data we transfer via CDC to Data Warehouses and operational services - not a chance to add the AvroSchema to each message. This produces 10+ times more data in kafka - even if we compress it
  • Our goal is to move that data via kafka to other consumers like databases, files or services without the need to reconstruct the schema or types afterwards again when consuming - Avro with Schema-Registry would be just perfect here
  • Also we cannot run 400 single streamsets pipelines (memory requirements) each with a single avro schema

I did also not found any alternative solution with streamsets meeting our needs so far. But unfortunately this would block the use of streamsets for medium and large data integration projects for us. It's also not possible to fix it by ourselves by just cloning and adjusting a stage, since it seems to be an SDC inherent feature.

For us, it's critical and blocking for production usage and I would be very happy to see a solution with streamsets.

Best regards, Peter

edit flag offensive delete link more
Login/Signup to Answer

Question Tools

1 follower


Asked: 2018-12-10 11:38:41 -0600

Seen: 210 times

Last updated: Jan 23 '19