Partitioning, Compaction and the Small-File Problem

The two operational habits that decide whether your lake stays fast.

0/2 done

Theory

The lake's two enemies: too many tiny files, wrong partitions

Partitioning physically separates rows by a column (event_date=2026-05-27/...). Done right, scans skip 99% of the data. Done wrong (high-cardinality column like user_id) you create millions of microscopic files and the engine spends its life opening them.

Compaction (OPTIMIZE in Delta, rewrite_data_files in Iceberg) merges those microscopic files into ~128MB–1GB chunks — the sweet spot for object storage + columnar reads.

Rules of thumb: partition on a low-cardinality column the queries actually filter on (usually a date); run compaction on a schedule; never partition on something with millions of distinct values.

Analogy

Partitioning is how you file paper in a filing cabinet. File by month (low cardinality) and finding 'everything from May' means pulling one drawer. File by customer's full name (high cardinality) and you get a million one-sheet folders — finding anything means opening every drawer in the building. Compaction is the office junior who comes in at night and merges those thousands of one-page folders into tidy hundred-page binders, so tomorrow's search touches a handful of binders instead of a blizzard of loose sheets. Skip the junior's night shift and the cabinet silently grinds to a halt.

Reflect

Most 'our lakehouse got slow' incidents trace back to one of three habits being missing: wrong partition column, no compaction schedule, or a streaming job dropping microbatches without trigger interval discipline.

  • Run a `file_count` audit on your biggest tables — which one is over 100k files?
  • Which producer team is creating the worst small-file pollution, and what would change their incentive?

Reading in progress · 0 of 2 activities done