From alert to runbook to fix
Monitoring is only useful if it triggers action
Drift dashboards (Level 5) are decoration unless they connect to a runbook. A mature ML on-call answers, in advance:
- What fires the alert? Performance drop, input drift beyond a threshold, latency/error SLO breach, or a sharp prediction-distribution shift.
- Who is paged? ML platform for serving failures; the model owner for quality degradation. Define it before the incident.
- What are the levers? In order of reversibility: roll back to the previous model version, fall back to a rules baseline, throttle traffic, then retrain.
- Automated vs manual retraining. Scheduled retraining is predictable; triggered retraining (fire when drift crosses a bar) reacts faster but needs a guard so it can't ship a worse model — always re-run the evaluation gate before promotion.
The closing loop: incident → root cause → retrain or roll back → post-mortem → tighten the trigger. Same discipline as SRE, applied to model quality.