What do behavioural tests catch that aggregate accuracy does not?

Specific failure modes — e.g. a feature moving the prediction in the wrong direction

The ML Testing Pyramid — Semantic Web Academy

Three layers, not one

Ordinary software has unit/integration/e2e tests. ML adds two axes — data and model behaviour — so a robust ML CI runs three families of test:

Code tests — the feature transforms, the training loop, the serving handler. Plain pytest.
Data tests — schema, ranges, null rates, category sets, and distribution checks on every incoming batch (Great Expectations, pandera, TFDV). A pipeline that trains on garbage ships garbage.
Behavioural tests — assert properties of the model, not just aggregate accuracy:
- Invariance: changing an irrelevant feature shouldn't flip the prediction.
- Directional: raising income should not increase default probability.
- Minimum-functionality: obvious cases must be correct (a 30-year, fully-paid customer is not high-risk).

These behavioural checks (popularised by the CheckList paper) catch failures that aggregate accuracy hides.

The ML Testing Pyramid

Test the data and the behaviour

Three layers, not one

Analogy

Reflect