Test the data and the behaviour
Three layers, not one
Ordinary software has unit/integration/e2e tests. ML adds two axes — data and model behaviour — so a robust ML CI runs three families of test:
- Code tests — the feature transforms, the training loop, the serving handler. Plain pytest.
- Data tests — schema, ranges, null rates, category sets, and distribution checks on every incoming batch (Great Expectations, pandera, TFDV). A pipeline that trains on garbage ships garbage.
- Behavioural tests — assert properties of the model, not just aggregate accuracy:
- Invariance: changing an irrelevant feature shouldn't flip the prediction.
- Directional: raising income should not increase default probability.
- Minimum-functionality: obvious cases must be correct (a 30-year, fully-paid customer is not high-risk).
These behavioural checks (popularised by the CheckList paper) catch failures that aggregate accuracy hides.