Data Observability & Anomaly Detection

Volume / freshness / schema drift / distribution checks — monitoring for datasets the way SRE monitors services.

0/2 done

Overview

Data Observability & Anomaly Detection

Volume / freshness / schema drift / distribution checks — monitoring for datasets the way SRE monitors services.

Why it matters

Data observability is the operational layer that catches a silent break before the dashboard does.

Going deeper

The four pillars, with the most common silent failure each catches:

PillarCatchesTypical alert
FreshnessUpstream job hung / cron skippedmax(updated_at) < now() - 1h
VolumeFilter regression dropping 90 % of rowsrow count ± 3σ of 7-day rolling mean
SchemaColumn rename, type change, dropped fieldhash of information_schema for the table changed
DistributionLocale bug filling a column with NULLs / nonsensenull-rate, cardinality, mean/p99 outside band

All four can be implemented as cheap SQL or as a managed product (Monte Carlo, Bigeye, Soda, Datafold, Elementary). The decision isn't whether to instrument; it's who maintains the rules and how alert fatigue is kept under control.

Analogy

Data observability is SRE for datasets.

For services, SRE teams instrument the four golden signals: latency, traffic, errors, saturation. Without them, you only learn about an outage when the CEO's phone won't load Twitter.

Datasets need the same hygiene — but mapped to their failure modes. The job finished on time (freshness), the row count is in band (volume), the columns are still the columns you signed up for (schema), and the values still look like the values you signed up for (distribution). When one drifts, you page before the dashboard quietly turns into a story you'll have to retract on Monday.

Make it stick

Use the prompts below to anchor data observability & anomaly detection to something you actually own.

  • List the three datasets most-relied-on by your team. For each, which of the four pillars is *currently unmonitored*?
  • What's your team's mean-time-to-detect for a silent data break today? Estimate how it would change with the four pillars instrumented.
  • Where's the line between 'useful alert' and 'noise that gets muted'? What guardrails would keep DQ alerts above that line?

Reading in progress · 0 of 2 activities done