Pointers, not blobs
Why Git alone is not enough
Git is fantastic for small text files. It is terrible for a 20 GB Parquet partition. Two tools fill the gap:
- DVC — pointer files in Git, blobs in S3/GCS/Azure/SSH. Good for ML training datasets.
- lakeFS — Git-like branching for entire data lakes. Good when many teams share the same lake.
Whichever you pick, the invariant is the same:
Every model run records the exact dataset hash it was trained on.
Without that, 'reproducible' is a lie.