SLI, SLO, error budget
SRE meets ML
Google's SRE book proposes four 'golden signals' for any production service: latency, traffic, errors, saturation. For ML add a fifth lane:
- Quality — model performance against ground truth (delayed but essential).
Per signal, define an SLI (what we measure) and an SLO (the target). Examples:
- SLI: 99th-percentile end-to-end inference latency. SLO: < 200 ms over 30 days.
- SLI: % of requests where signature validation failed. SLO: < 0.1%.
- SLI: weekly ROC-AUC on backfilled labels. SLO: ≥ baseline − 0.5pp.
Error budget = 1 − SLO. When you exhaust it, freeze model promotions.