Overview
Lake vs Warehouse vs Lakehouse
Schema-on-read vs schema-on-write storage — and the lakehouse compromise.
Why it matters
The lake holds raw / semi / unstructured data cheaply; the warehouse holds curated, schema-strict tables; the lakehouse (Delta / Iceberg / Hudi) puts ACID + schema on top of lake storage.
Going deeper
What an open table format actually buys you, line by line:
- Atomic commits — a write either lands entirely or not at all; readers never see a half-written table. (Parquet alone can't promise this.)
- Time travel — query the table as of yesterday at 14:00. Cheap rollback, reproducible ML training cuts, regulator-friendly audit.
- Schema evolution — add / rename / reorder columns without rewriting every file; old readers stay valid.
- Hidden partitioning + pruning — scanners skip files via metadata, not directory layout. Means partition strategy can change without breaking queries.
- Engine pluggability — Spark, Trino, Flink, DuckDB, Snowflake (for Iceberg) can all read the same table. The format outlives any one engine.