What is the FOUR-pillar model of data observability?

Volume, Freshness, Schema, Distribution

Storage, Compute, Network, Cost

Your `daily_active_users` metric drops 40% overnight. Volume / freshness / schema all green. Which pillar is your next investigation?

Distribution — a value-level regression (timezone bug, country misclassification, bot filter change)

Data Observability & Anomaly Detection — Semantic Web Academy

Overview

Data Observability & Anomaly Detection

Volume / freshness / schema drift / distribution checks — monitoring for datasets the way SRE monitors services.

Why it matters

Data observability is the operational layer that catches a silent break before the dashboard does.

Going deeper

The four pillars, with the most common silent failure each catches:

Pillar	Catches	Typical alert
Freshness	Upstream job hung / cron skipped	`max(updated_at) < now() - 1h`
Volume	Filter regression dropping 90 % of rows	row count ± 3σ of 7-day rolling mean
Schema	Column rename, type change, dropped field	hash of `information_schema` for the table changed
Distribution	Locale bug filling a column with NULLs / nonsense	null-rate, cardinality, mean/p99 outside band

All four can be implemented as cheap SQL or as a managed product (Monte Carlo, Bigeye, Soda, Datafold, Elementary). The decision isn't whether to instrument; it's who maintains the rules and how alert fatigue is kept under control.

Analogy

Data observability is SRE for datasets.

For services, SRE teams instrument the four golden signals: latency, traffic, errors, saturation. Without them, you only learn about an outage when the CEO's phone won't load Twitter.

Datasets need the same hygiene — but mapped to their failure modes. The job finished on time (freshness), the row count is in band (volume), the columns are still the columns you signed up for (schema), and the values still look like the values you signed up for (distribution). When one drifts, you page before the dashboard quietly turns into a story you'll have to retract on Monday.

Make it stick

Use the prompts below to anchor data observability & anomaly detection to something you actually own.

›List the three datasets most-relied-on by your team. For each, which of the four pillars is *currently unmonitored*?
›What's your team's mean-time-to-detect for a silent data break today? Estimate how it would change with the four pillars instrumented.
›Where's the line between 'useful alert' and 'noise that gets muted'? What guardrails would keep DQ alerts above that line?

Reading in progress · 0 of 2 activities done