Databricks — Delta Lake & the Transaction Log

How a JSON log over Parquet buys you ACID, time travel and streaming upserts.

0/2 done

Theory

A log turns dumb files into a real table

Raw Parquet on S3 has no transactions: two writers clobber each other, a half-finished job leaves readers seeing garbage, and there's no 'as of yesterday'. Delta Lake fixes this with one idea — an ordered transaction log (_delta_log/) of JSON commit files that record which Parquet files are part of the table right now.

  • ACID via the log — A write is only 'real' once its commit is atomically appended to the log. Readers always see a consistent snapshot; a failed job's orphan files are simply never referenced. This is optimistic concurrency — conflicting commits are detected and retried, not silently lost.
  • Time travel — Each log version is a point-in-time table state, so SELECT ... VERSION AS OF 12 or TIMESTAMP AS OF '...' just replays the log to that version. Same idea as Snowflake Time Travel, implemented on open files.
  • MERGE / upserts & CDC — Delta supports MERGE INTO (insert/update/delete in one atomic statement), which makes SCD-2 dimensions and change-data-capture sinks practical on a lake. OPTIMIZE (compaction) and Z-ORDER (multi-column data skipping) keep read performance healthy as small files pile up.

Use Case Example: A CDC stream from a Postgres orders table lands as inserts/updates/deletes. A single Delta MERGE INTO orders USING changes ... applies all three atomically, every micro-batch — readers never see a partial apply, and VERSION AS OF lets you audit exactly what the table looked like before last night's load.

Analogy

The Delta log is a bank ledger sitting on top of a pile of cash drawers (Parquet files). The cash is just paper until the ledger says which notes count as today's balance. Two tellers can't both spend the same note, because a transaction is only real once it's written, in order, to the ledger — and you can always read the ledger back to last Tuesday to see exactly what the balance was (time travel). Lose the cash drawers' chaos, keep the ledger's truth.

Reflect

Delta proves you don't need a proprietary database to get database guarantees — you need an ordered log and discipline about appending to it. That's the same insight behind Kafka, Git and write-ahead logs everywhere.

  • Where are you running upserts today with brittle delete-then-insert logic that MERGE would make atomic?
  • Do your streaming Delta tables get OPTIMIZE'd, or are small files quietly slowing every read?

Reading in progress · 0 of 2 activities done