Theory
'Run me again' should be boring
An idempotent pipeline produces the same output regardless of how many times you run it for the same logical input (usually a date partition). When that's true:
- A failed run can be retried without dedupe gymnastics.
- A bug fix can be backfilled over months of partitions in parallel.
- Two ops engineers can rerun the same window without fear.
How to get there:
- Parametrise on the partition (
{{ ds }}in Airflow,partition_keyin Dagster) — never onnow(). - Replace, don't append.
INSERT OVERWRITEthe target partition, or useMERGEkeyed on the natural key. - Never read from sources that mutate without versioning. If the source is mutable, snapshot it first (raw zone).
Most production data bugs trace back to a step that violated one of these three rules.