Theory
Ending the lake-vs-warehouse split
For a decade teams ran two systems: a cheap, open data lake (Parquet on S3) for data science, and an expensive, reliable data warehouse for BI — copying data between them and arguing over which was the truth. Databricks' Lakehouse thesis: add warehouse guarantees directly on top of the open lake files, so one copy of the data serves both.
- Open storage — Your data stays as Parquet files in your object storage (S3/ADLS/GCS). You are not locked into a proprietary store; other engines can read the same files.
- Delta Lake — The layer that makes those files trustworthy: a transaction log (
_delta_log) over the Parquet gives you ACID transactions, schema enforcement, and time travel — the things a raw lake never had. (Covered in depth next lesson.) - Unified compute — The same platform runs Spark for data-engineering and ML, and fast SQL (via SQL Warehouses + the Photon engine) for BI. One governance model, one copy of data, both audiences.
Use Case Example: A bank kept a Parquet lake for its ML team and a separate warehouse for finance dashboards — two pipelines, two copies, constant drift. On the Lakehouse, the curated tables live once as Delta tables in the lake; data scientists hit them with Spark/Python and analysts hit the same tables with SQL — no copy, no drift, one set of permissions.