The Four Golden Signals — Adapted for ML

Latency, traffic, errors, saturation — plus quality.

0/1 done

SLI, SLO, error budget

SRE meets ML

Google's SRE book proposes four 'golden signals' for any production service: latency, traffic, errors, saturation. For ML add a fifth lane:

Quality — model performance against ground truth (delayed but essential).

Per signal, define an SLI (what we measure) and an SLO (the target). Examples:

SLI: 99th-percentile end-to-end inference latency. SLO: < 200 ms over 30 days.
SLI: % of requests where signature validation failed. SLO: < 0.1%.
SLI: weekly ROC-AUC on backfilled labels. SLO: ≥ baseline − 0.5pp.

Error budget = 1 − SLO. When you exhaust it, freeze model promotions.

Analogy

An ICU monitor: heart rate, BP, O2 sat, temperature. None on its own tells the whole story; together they paint a picture. ML monitoring is the same — single-signal dashboards lie.

Reading in progress · 0 of 1 activity done