Data Versioning with DVC and lakeFS

Code in Git, data in… the right tool for the right artefact.

0/2 done

Pointers, not blobs

Why Git alone is not enough

Git is fantastic for small text files. It is terrible for a 20 GB Parquet partition. Two tools fill the gap:

  • DVC — pointer files in Git, blobs in S3/GCS/Azure/SSH. Good for ML training datasets.
  • lakeFS — Git-like branching for entire data lakes. Good when many teams share the same lake.

Whichever you pick, the invariant is the same:

Every model run records the exact dataset hash it was trained on.

Without that, 'reproducible' is a lie.

Analogy

Git versions the menu. DVC / lakeFS versions the pantry. You wouldn't squeeze 50 kilos of tomatoes into a recipe card, and you wouldn't squeeze a 20 GB parquet file into a git commit. Different objects, different stores, linked by a label.

Reading in progress · 0 of 2 activities done