The 0.94 → 0.87 mystery
A familiar story
On Tuesday your gradient-boosted classifier hits 0.94 ROC-AUC on the holdout. You ship the notebook to Slack, screenshot the metric, and celebrate.
Six weeks later compliance asks you to retrain on a refreshed snapshot. You re-run the notebook. AUC is 0.87.
Possible culprits — pick the right one in under a minute:
pandasupgraded from 2.1 → 2.2 and changed how NaN flows throughgroupby.- The random seed for the train/val split was never set.
- Your
feature_engineering_v3.ipynbwas renamed to_final_FINAL. - The CSV in S3 has 11 more columns than last time and you silently dropped them.
- A teammate fixed a leakage bug after the original run.
Without a recorded experiment, you can't tell. Welcome to the reproducibility crisis — the single biggest reason ML projects die between proof-of-concept and production.