Idempotency, Backfills and the Reproducibility Contract

The single property that decides whether you sleep well on call.

0/3 done

Theory

'Run me again' should be boring

An idempotent pipeline produces the same output regardless of how many times you run it for the same logical input (usually a date partition). When that's true:

  • A failed run can be retried without dedupe gymnastics.
  • A bug fix can be backfilled over months of partitions in parallel.
  • Two ops engineers can rerun the same window without fear.

How to get there:

  1. Parametrise on the partition ({{ ds }} in Airflow, partition_key in Dagster) — never on now().
  2. Replace, don't append. INSERT OVERWRITE the target partition, or use MERGE keyed on the natural key.
  3. Never read from sources that mutate without versioning. If the source is mutable, snapshot it first (raw zone).

Most production data bugs trace back to a step that violated one of these three rules.

Analogy

An idempotent pipeline is a light switch: flip it 'on' five times and the room is still just on — pressing again changes nothing. A non-idempotent pipeline is a doorbell that adds a guest to the party each time it rings: retry it after a hiccup and suddenly you've invited the same person three times (duplicate rows). The whole craft of safe backfills is making every pipeline a switch, not a doorbell — so 'just run it again' is a shrug, not a 2am incident.

Reflect

Idempotency is the single highest-leverage property in a data platform. Teams that enforce it sleep through on-call rotations; teams that don't ship runbooks full of manual dedupe steps and 'just delete the bad rows' phone calls.

  • Which of your prod pipelines would survive being rerun for the same partition right now?
  • What's the smallest enforcement you could add — a CI lint that fails on `NOW()` in a model, perhaps?

Reading in progress · 0 of 3 activities done