Kill-Switches and ML Incidents

Designing a model that you can turn off — and an on-call that knows how.

0/1 done

Designed reversibility

Three layers of off-switch

  1. Feature flag — disable the model code path; fall back to rules / baseline. Fastest, smallest blast radius.
  2. Registry rollback — point the alias at the previous version. Seconds.
  3. Hard cut-off — stop serving entirely; downstream uses a safe default. Last resort.

Runbook minimum

  • Who pages whom.
  • How to revert the alias / flip the flag (exact commands).
  • The safe baseline and its behaviour.
  • The threshold that automatically triggers a freeze on promotions (error budget exhausted).

Test the kill-switch quarterly. An untested switch is worse than no switch — it gives false confidence.

Analogy

Every well-run lab has an emergency stop button on the centrifuge. Nobody intends to press it. Everyone knows where it is. The button is tested. ML deserves the same discipline.

Reflect

Design your kill-switch.

  • What is the *exact command* to revert the production model right now?
  • Who can issue it? At 3 AM on a Sunday?
  • When did you last test it on purpose?

Reading in progress · 0 of 1 activity done