Don't fool yourself
Make experiments answer questions, not generate numbers
Scaling experimentation isn't only GPUs and trackers — it's discipline that stops you fooling yourself:
- Always have a baseline. A dumb majority-class or last-value model. If you can't beat it convincingly, stop.
- Hold out a true test set touched once, at the end. Tuning against the test set is how 0.94 becomes 0.87 in production.
- Use cross-validation for the variance estimate; a single split's 0.3% lift is often inside the noise band.
- Fix and log seeds so a 'win' is reproducible, not luck.
- One hypothesis per run. Change the model or the features, not both, or you can't attribute the change.
The cardinal sin is leakage — letting test information seep into training (fitting a scaler on the full dataset, using future data, target encoding before the split). Leakage produces spectacular offline scores that evaporate on real traffic.