Your model dropped 7 AUC points after a 'simple' retrain. What is the single most useful thing to have logged?

Exact code commit + data hash + hyper-parameters of the original run

A screenshot of the original metric

Exact code commit + data hash + hyper-parameters of the original run

The Reproducibility Crisis — Semantic Web Academy

The 0.94 → 0.87 mystery

A familiar story

On Tuesday your gradient-boosted classifier hits 0.94 ROC-AUC on the holdout. You ship the notebook to Slack, screenshot the metric, and celebrate.

Six weeks later compliance asks you to retrain on a refreshed snapshot. You re-run the notebook. AUC is 0.87.

Possible culprits — pick the right one in under a minute:

pandas upgraded from 2.1 → 2.2 and changed how NaN flows through groupby.
The random seed for the train/val split was never set.
Your feature_engineering_v3.ipynb was renamed to _final_FINAL.
The CSV in S3 has 11 more columns than last time and you silently dropped them.
A teammate fixed a leakage bug after the original run.

Without a recorded experiment, you can't tell. Welcome to the reproducibility crisis — the single biggest reason ML projects die between proof-of-concept and production.

Lab notebook vs scratchpad

Think of a chemistry lab notebook. Every flask, every temperature, every timestamp goes on paper as it happens. Fifty years later another scientist can re-run the experiment exactly.

A Jupyter notebook is the opposite: cells run out of order, variables are mutated in place, the kernel is restarted and nobody knows which df is in memory. You aren't keeping a lab notebook — you're keeping a diary of your moods.

Personal audit

Audit your last three ML projects honestly.

›How many can you re-run end-to-end *today* and reproduce the headline metric?
›Where does the original code live? The data? The hyper-parameters?
›What was the first artefact you lost after shipping?

Reading in progress · 0 of 1 activity done