Databricks — The Lakehouse Idea

One system for BI and ML on open files: warehouse reliability over a data lake.

0/1 done

Theory

Ending the lake-vs-warehouse split

For a decade teams ran two systems: a cheap, open data lake (Parquet on S3) for data science, and an expensive, reliable data warehouse for BI — copying data between them and arguing over which was the truth. Databricks' Lakehouse thesis: add warehouse guarantees directly on top of the open lake files, so one copy of the data serves both.

  • Open storage — Your data stays as Parquet files in your object storage (S3/ADLS/GCS). You are not locked into a proprietary store; other engines can read the same files.
  • Delta Lake — The layer that makes those files trustworthy: a transaction log (_delta_log) over the Parquet gives you ACID transactions, schema enforcement, and time travel — the things a raw lake never had. (Covered in depth next lesson.)
  • Unified compute — The same platform runs Spark for data-engineering and ML, and fast SQL (via SQL Warehouses + the Photon engine) for BI. One governance model, one copy of data, both audiences.

Use Case Example: A bank kept a Parquet lake for its ML team and a separate warehouse for finance dashboards — two pipelines, two copies, constant drift. On the Lakehouse, the curated tables live once as Delta tables in the lake; data scientists hit them with Spark/Python and analysts hit the same tables with SQL — no copy, no drift, one set of permissions.

Analogy

The old world is renting two flats: a cheap garage (the lake) to tinker with raw parts, and a pristine showroom (the warehouse) to present finished cars — hauling cars back and forth and scratching them in transit. The Lakehouse is one building with both a workshop and a showroom floor: the same cars, worked on and displayed in place, under one set of keys. You stop paying two rents and you stop losing the car to scratches during the move (data drift between copies).

One lake, two workloads

Click a node to focus its neighbourhood · drag to pan · scroll to zoom

Lakehouse: one storage layer, two audiences

Open Parquet + a Delta transaction log = warehouse guarantees on lake economics, read by both Spark/ML and SQL/BI.

Reflect

The Lakehouse pitch is really about removing a copy. Every copy of data is a future inconsistency, a doubled bill and a second permission model. When you evaluate it, count the copies it deletes from your current architecture.

  • How many copies of your 'source of truth' tables exist across lake and warehouse today?
  • Which team would resist collapsing them into one — and is the reason technical or political?

Reading in progress · 0 of 1 activity done