Lake vs Warehouse vs Lakehouse

Schema-on-read vs schema-on-write storage — and the lakehouse compromise.

0/2 done

Overview

Lake vs Warehouse vs Lakehouse

Schema-on-read vs schema-on-write storage — and the lakehouse compromise.

Why it matters

The lake holds raw / semi / unstructured data cheaply; the warehouse holds curated, schema-strict tables; the lakehouse (Delta / Iceberg / Hudi) puts ACID + schema on top of lake storage.

Going deeper

What an open table format actually buys you, line by line:

  • Atomic commits — a write either lands entirely or not at all; readers never see a half-written table. (Parquet alone can't promise this.)
  • Time travel — query the table as of yesterday at 14:00. Cheap rollback, reproducible ML training cuts, regulator-friendly audit.
  • Schema evolution — add / rename / reorder columns without rewriting every file; old readers stay valid.
  • Hidden partitioning + pruning — scanners skip files via metadata, not directory layout. Means partition strategy can change without breaking queries.
  • Engine pluggability — Spark, Trino, Flink, DuckDB, Snowflake (for Iceberg) can all read the same table. The format outlives any one engine.

Analogy

Think of these three as three kinds of warehouse buildings:

  • A data lake is a self-storage facility: rent a unit cheaply, throw any shape of box in it, sort it out when you need it. Schema-on-read. Cheap, flexible, easy to lose track of what's inside.
  • A data warehouse is an Amazon fulfilment centre: every item barcoded, slotted, indexed; a robot can pick any SKU in seconds. Schema-on-write. Fast to query, expensive to load, painful to change.
  • A lakehouse is self-storage with a barcode scanner and an inventory app on top. The cheap building stays; ACID + schema metadata sit on top so any consumer can ask 'what's in unit 17 right now, and what was in it last Tuesday?' without breaking the cheap-storage economics.

Open table formats (Iceberg, Delta, Hudi) are that barcode scanner.

Make it stick

Use the prompts below to anchor lake vs warehouse vs lakehouse to something you actually own.

  • Which of your current pipelines suffers from 'half-written reads' or non-reproducible historical queries? Those are lakehouse-shaped problems.
  • If your warehouse vendor doubled in price tomorrow, which workloads could move to a lakehouse architecture, and which truly need warehouse-grade latency?
  • What's the smallest dataset you could prototype on Iceberg / Delta next sprint to validate the operational model?

Reading in progress · 0 of 2 activities done