What does an open table format (Iceberg / Delta / Hudi) add to a data lake?

ACID transactions, schema evolution and time-travel on top of cheap object storage

Your platform team needs reproducible nightly model training across petabytes of event data sitting in S3. Which storage choice gives the best cost / control trade-off?

Keep the data on object storage but adopt an open table format (Iceberg / Delta / Hudi) for ACID writes and time-travel reads

Keep raw Parquet and accept non-reproducible training cuts

Migrate everything into a classical warehouse

Keep the data on object storage but adopt an open table format (Iceberg / Delta / Hudi) for ACID writes and time-travel reads

Lake vs Warehouse vs Lakehouse — Semantic Web Academy

Overview

Lake vs Warehouse vs Lakehouse

Schema-on-read vs schema-on-write storage — and the lakehouse compromise.

Why it matters

The lake holds raw / semi / unstructured data cheaply; the warehouse holds curated, schema-strict tables; the lakehouse (Delta / Iceberg / Hudi) puts ACID + schema on top of lake storage.

Going deeper

What an open table format actually buys you, line by line:

Atomic commits — a write either lands entirely or not at all; readers never see a half-written table. (Parquet alone can't promise this.)
Time travel — query the table as of yesterday at 14:00. Cheap rollback, reproducible ML training cuts, regulator-friendly audit.
Schema evolution — add / rename / reorder columns without rewriting every file; old readers stay valid.
Hidden partitioning + pruning — scanners skip files via metadata, not directory layout. Means partition strategy can change without breaking queries.
Engine pluggability — Spark, Trino, Flink, DuckDB, Snowflake (for Iceberg) can all read the same table. The format outlives any one engine.

The names describe different layers. A lake is primarily a storage pattern: objects retained cheaply with schema applied by readers. A warehouse is a managed analytical database that owns storage, transactions, optimization, and serving. A lakehouse separates durable object storage from an open transactional table format and one or more compute engines. It does not automatically provide a catalog, workload isolation, row-level security, compaction, or low-latency serving; those remain platform responsibilities. Choose by workload and operating capacity, not by freshness of the label.

Analogy

Think of these three as three kinds of warehouse buildings:

A data lake is a self-storage facility: rent a unit cheaply, throw any shape of box in it, sort it out when you need it. Schema-on-read. Cheap, flexible, easy to lose track of what's inside.
A data warehouse is an Amazon fulfilment centre: every item barcoded, slotted, indexed; a robot can pick any SKU in seconds. Schema-on-write. Fast to query, expensive to load, painful to change.
A lakehouse is self-storage with a barcode scanner and an inventory app on top. The cheap building stays; ACID + schema metadata sit on top so any consumer can ask 'what's in unit 17 right now, and what was in it last Tuesday?' without breaking the cheap-storage economics.

Open table formats (Iceberg, Delta, Hudi) are that barcode scanner.

Make it stick

Use the prompts below to anchor lake vs warehouse vs lakehouse to something you actually own.

›Which of your current pipelines suffers from 'half-written reads' or non-reproducible historical queries? Those are lakehouse-shaped problems.
›If your warehouse vendor doubled in price tomorrow, which workloads could move to a lakehouse architecture, and which truly need warehouse-grade latency?
›What's the smallest dataset you could prototype on Iceberg / Delta next sprint to validate the operational model?

Reading in progress · 0 of 2 activities done