The Reproducibility Crisis

Why a model that scored 0.94 on Tuesday is 0.87 in November.

0/1 done

The 0.94 → 0.87 mystery

A familiar story

On Tuesday your gradient-boosted classifier hits 0.94 ROC-AUC on the holdout. You ship the notebook to Slack, screenshot the metric, and celebrate.

Six weeks later compliance asks you to retrain on a refreshed snapshot. You re-run the notebook. AUC is 0.87.

Possible culprits — pick the right one in under a minute:

  • pandas upgraded from 2.1 → 2.2 and changed how NaN flows through groupby.
  • The random seed for the train/val split was never set.
  • Your feature_engineering_v3.ipynb was renamed to _final_FINAL.
  • The CSV in S3 has 11 more columns than last time and you silently dropped them.
  • A teammate fixed a leakage bug after the original run.

Without a recorded experiment, you can't tell. Welcome to the reproducibility crisis — the single biggest reason ML projects die between proof-of-concept and production.

Lab notebook vs scratchpad

Think of a chemistry lab notebook. Every flask, every temperature, every timestamp goes on paper as it happens. Fifty years later another scientist can re-run the experiment exactly.

A Jupyter notebook is the opposite: cells run out of order, variables are mutated in place, the kernel is restarted and nobody knows which df is in memory. You aren't keeping a lab notebook — you're keeping a diary of your moods.

Personal audit

Audit your last three ML projects honestly.

  • How many can you re-run end-to-end *today* and reproduce the headline metric?
  • Where does the original code live? The data? The hyper-parameters?
  • What was the first artefact you lost after shipping?

Reading in progress · 0 of 1 activity done