Schema on Read vs Schema on Write

Pay the schema cost up-front, or pay it on every read.

0/2 done

Where the cost lands

When do you pay the schema tax?

  • Schema-on-write (RDBMS, warehouse): the schema must exist before you insert. Writes reject bad data; reads are fast and predictable.
  • Schema-on-read (data lake, JSON store): write anything, infer / project a schema only when querying. Writes never fail on shape; reads carry the parsing cost forever.

Heuristic: the more downstream consumers, the more schema-on-write pays off. The more exploratory the data, the more schema-on-read pays off.

TSA vs lost-and-found

Schema-on-write is TSA security at the airport: every passenger gets checked once, on entry. Slow gate, fast plane.

Schema-on-read is the lost-and-found bin: throw anything in, sort it out when someone comes looking. Fast bin, slow search.

Where does the tax land?

Locate the schema tax in your own stack.

  • Which dataset in your team is currently schema-on-read but *should* be schema-on-write because it has many trusting consumers?
  • Which dataset is currently schema-on-write but *should* be schema-on-read because the data is exploratory and the producer's schema keeps changing?
  • Where in your pipeline is the schema cost being paid twice (once on ingest, once on every query) without anyone noticing?

Reading in progress · 0 of 2 activities done