Code + Data + Model
Three test layers
- Code tests — pure-Python unit tests on feature functions and pre-processing.
- Data tests — schema, null rate, ranges; tools: Great Expectations, Soda, dbt tests.
- Model tests — behavioural (does the model respond as expected to a sentinel input?), invariance (paraphrasing a sentence does not flip sentiment), directional (raising income raises credit limit).
Inspired by 'Beyond Accuracy: Behavioral Testing of NLP Models with CheckList' (Ribeiro et al., 2020).
A pragmatic PR pipeline
lint → unit tests → data tests (sample) → train (smoke)
→ behavioural tests on saved model
→ metric gate (no regression > 0.5%)