Lineage — Knowing What Breaks When You Change That Column

Column-level lineage, OpenLineage, impact analysis as a daily tool.

0/1 done

Theory

Lineage is the platform's memory

Lineage maps which upstreams produced each table/column and which downstreams depend on it. The mature use cases:

  • Impact analysis before a schema change: 'which dashboards, ML features, reverse-ETL syncs read this column?'
  • Incident triage when a metric looks wrong: walk the lineage upstream until you find the broken hop.
  • Compliance: prove that a PII field never reached an uncontrolled downstream.

OpenLineage is the open standard (originated at Marquez, now under LF AI & Data). It is emitted natively by Airflow, Dagster, dbt, Spark, Flink. Centralising those events into a Marquez / DataHub / OpenMetadata backend gives you the platform-wide graph for free.

Analogy

Lineage is the 'show ingredients' chain on a food label crossed with a contact-tracing app. When a peanut allergy alert comes in, you need to know instantly which finished products contain that batch of peanuts and recall exactly those — not the whole shelf. Column-level lineage does the same for data: rename customers.email and the graph lights up every dashboard, ML feature and downstream sync that secretly depends on it, before you ship the change. Without it, you're recalling blind and finding the contaminated report only when an executive tastes it.

Column lineage — impact at a glance

Click a node to focus its neighbourhood · drag to pan · scroll to zoom

A column-level lineage graph in action

The platform answers 'who reads customers.email?' and 'what breaks if we rename it?' before the PR ships.

Reflect

Lineage's value is exponential: each new emitter (Airflow, dbt, Spark) adds a factor to what the catalog can answer. Conversely, every blind spot — a Python ETL script that doesn't emit OpenLineage — silently breaks impact analysis for everything downstream.

  • Which tools in your stack don't emit OpenLineage yet — and what would it cost to add?
  • When was the last time a schema change broke a downstream nobody knew existed?

Reading in progress · 0 of 1 activity done