Why touch the held-out test set only once, at the very end?

Repeated tuning against it leaks information and inflates scores that won't hold in production

Experiment Design & Statistical Discipline — Semantic Web Academy

Make experiments answer questions, not generate numbers

Scaling experimentation isn't only GPUs and trackers — it's discipline that stops you fooling yourself:

Always have a baseline. A dumb majority-class or last-value model. If you can't beat it convincingly, stop.
Hold out a true test set touched once, at the end. Tuning against the test set is how 0.94 becomes 0.87 in production.
Use cross-validation for the variance estimate; a single split's 0.3% lift is often inside the noise band.
Fix and log seeds so a 'win' is reproducible, not luck.
One hypothesis per run. Change the model or the features, not both, or you can't attribute the change.

The cardinal sin is leakage — letting test information seep into training (fitting a scaler on the full dataset, using future data, target encoding before the split). Leakage produces spectacular offline scores that evaporate on real traffic.

Experiment Design & Statistical Discipline

Don't fool yourself

Make experiments answer questions, not generate numbers

Analogy

Reflect