Change Data Capture (CDC) — the OLTP→OLAP Bridge

Debezium, log-based vs query-based, why CDC ate nightly dumps.

0/2 done

Theory

Read the WAL, not the tables

Nightly SELECT * FROM orders dumps are the dark age. They miss deletes, miss intra-day changes, hammer the OLTP DB, and scale linearly with table size.

Log-based CDC reads the database's write-ahead log (Postgres WAL, MySQL binlog, SQL Server CDC tables) and publishes every row change as an event — typically into Kafka via Debezium. The result:

  • Captures inserts, updates and deletes.
  • Near-zero load on the OLTP DB (read from the log, not the tables).
  • Naturally event-driven: every downstream system gets the same stream.

Query-based CDC (poll WHERE updated_at > last_seen) still has its place for sources without log access, but it cannot see deletes and cannot guarantee no missed changes under concurrent writes.

Analogy

Query-based CDC is checking your bank balance once a day and guessing what happened: if it's the same at 9am and 9pm you assume nothing moved — but you'd never see the deposit and withdrawal that cancelled out, and you'd miss a closed account entirely. Log-based CDC is reading the itemised statement instead: every single transaction, in order, including the ones that net to zero and the account closures (deletes). And you read it from the bank's own ledger (the write-ahead log) without ever bothering the teller — so the production database barely notices you're watching.

Reflect

CDC promises tidy event streams from messy OLTP databases — but you inherit the schema decisions of whoever built the source table years ago. Plan the transformation tier (stream→table normalisation, type casting, PII handling) before flipping the switch.

  • Which OLTP databases in your org are the highest-value CDC candidates?
  • Who owns the replication-slot monitoring — the DBA team, or your DE team?

Reading in progress · 0 of 2 activities done