The ML Testing Pyramid

Unit tests for code, data tests for inputs, behavioural tests for the model.

0/2 done

Test the data and the behaviour

Three layers, not one

Ordinary software has unit/integration/e2e tests. ML adds two axes — data and model behaviour — so a robust ML CI runs three families of test:

  1. Code tests — the feature transforms, the training loop, the serving handler. Plain pytest.
  2. Data tests — schema, ranges, null rates, category sets, and distribution checks on every incoming batch (Great Expectations, pandera, TFDV). A pipeline that trains on garbage ships garbage.
  3. Behavioural tests — assert properties of the model, not just aggregate accuracy:
    • Invariance: changing an irrelevant feature shouldn't flip the prediction.
    • Directional: raising income should not increase default probability.
    • Minimum-functionality: obvious cases must be correct (a 30-year, fully-paid customer is not high-risk).

These behavioural checks (popularised by the CheckList paper) catch failures that aggregate accuracy hides.

Analogy

Aggregate accuracy is a car's average speed — it hides that it stalls on every hill. Behavioural tests are the hill-start, the emergency-brake and the reverse-park checks: specific scenarios the model must never fail, however good the average looks.

Reflect

Design three behavioural tests for one of your models.

  • What invariance must always hold (which feature is irrelevant)?
  • What directional relationship encodes real domain knowledge?
  • Which 'obvious' case would embarrass you if the model got it wrong?

Reading in progress · 0 of 2 activities done