Incident Response & Retraining Triggers

The model is degrading at 2am — what's the runbook, and who decides to retrain?

0/1 done

From alert to runbook to fix

Monitoring is only useful if it triggers action

Drift dashboards (Level 5) are decoration unless they connect to a runbook. A mature ML on-call answers, in advance:

  • What fires the alert? Performance drop, input drift beyond a threshold, latency/error SLO breach, or a sharp prediction-distribution shift.
  • Who is paged? ML platform for serving failures; the model owner for quality degradation. Define it before the incident.
  • What are the levers? In order of reversibility: roll back to the previous model version, fall back to a rules baseline, throttle traffic, then retrain.
  • Automated vs manual retraining. Scheduled retraining is predictable; triggered retraining (fire when drift crosses a bar) reacts faster but needs a guard so it can't ship a worse model — always re-run the evaluation gate before promotion.

The closing loop: incident → root cause → retrain or roll back → post-mortem → tighten the trigger. Same discipline as SRE, applied to model quality.

Analogy

A drift alert with no runbook is a smoke detector wired to nothing. The point isn't the beep — it's that someone knows where the extinguisher is, who calls the fire brigade, and when to evacuate (roll back) versus fight the fire (retrain).

Reflect

Draft a one-page runbook for a model you care about.

  • What single signal would you trust enough to page someone at 2am?
  • What's your fastest reversible lever — rollback, baseline fallback, or throttle?
  • Who owns the decision to retrain, and what gate must the retrain pass?

Reading in progress · 0 of 1 activity done