Row vs Columnar — Why Parquet Wins Analytics

The 1-page intuition: physical layout dictates which queries are cheap.

0/1 done

Theory

Layout is destiny

A row-oriented file stores (id, name, email, country, amount) for row 1, then row 2, then row 3 on disk. Great for 'give me this whole user', terrible for 'sum amount across 100M rows' — you read every column you don't need.

A columnar file stores all id values together, then all name values, then all amount values. The analytics query reads only the amount column — often 1–5% of the bytes — and compression is dramatically better because adjacent values are of the same type and often similar.

Parquet (and ORC) are the industry standard columnar files. Combined with predicate pushdown (skip whole row groups via min/max statistics), they are the reason modern warehouses can scan petabytes for cents.

Analogy

Row-oriented storage is a library that shelves books by reader: 'all the books this person checked out' is fast; 'the most-borrowed page-127 across the city' requires opening every book. Columnar storage is a library that shelves all page-127s together: weird for one reader, devastatingly fast for the city-wide question. Analytics is a city-wide question.

Same query, two layouts

Click a node to focus its neighbourhood · drag to pan · scroll to zoom

The bytes on disk

Two layouts of the same table. The analytics query highlights how much each engine has to read.

Reflect

The leverage of columnar storage is so large that most modern OLTP engines now ship hybrid layouts (Postgres with column extensions, MySQL HeatWave, SQL Server columnstore indexes). The split is blurring — but the intuition of layout-driven cost is timeless.

  • Where in your pipeline do you still ship row-oriented analytical data (CSV exports, JSONL)?
  • What's the smallest pilot you could run to prove the Parquet savings to a sceptical stakeholder?

Reading in progress · 0 of 1 activity done