Theory
A log turns dumb files into a real table
Raw Parquet on S3 has no transactions: two writers clobber each other, a half-finished job leaves readers seeing garbage, and there's no 'as of yesterday'. Delta Lake fixes this with one idea — an ordered transaction log (_delta_log/) of JSON commit files that record which Parquet files are part of the table right now.
- ACID via the log — A write is only 'real' once its commit is atomically appended to the log. Readers always see a consistent snapshot; a failed job's orphan files are simply never referenced. This is optimistic concurrency — conflicting commits are detected and retried, not silently lost.
- Time travel — Each log version is a point-in-time table state, so
SELECT ... VERSION AS OF 12orTIMESTAMP AS OF '...'just replays the log to that version. Same idea as Snowflake Time Travel, implemented on open files. - MERGE / upserts & CDC — Delta supports
MERGE INTO(insert/update/delete in one atomic statement), which makes SCD-2 dimensions and change-data-capture sinks practical on a lake. OPTIMIZE (compaction) and Z-ORDER (multi-column data skipping) keep read performance healthy as small files pile up.
Use Case Example: A CDC stream from a Postgres orders table lands as inserts/updates/deletes. A single Delta MERGE INTO orders USING changes ... applies all three atomically, every micro-batch — readers never see a partial apply, and VERSION AS OF lets you audit exactly what the table looked like before last night's load.