Why Kafka Is Fast: Page Cache & Zero-Copy

No magic — just sequential IO, the OS page cache, and sendfile().

0/1 done

Sequential IO + page cache + sendfile

The mechanical sympathy behind the throughput

Kafka routinely pushes gigabytes/second per broker on commodity disks. The secret is that it cooperates with the operating system instead of fighting it:

  • Sequential, append-only writes. Spinning disks and SSDs are both an order of magnitude faster at sequential IO than random IO. An append-only log is the most disk-friendly access pattern that exists.
  • The OS page cache is the read cache. Kafka does not maintain its own in-process cache. Recently written segments are still in the kernel page cache, so consumers reading the tail almost never touch disk — they read from RAM the producer just warmed. This is why lagging consumers hurt: they force cold-segment reads that evict the hot pages everyone else relies on.
  • Zero-copy via sendfile(2). To ship bytes to a consumer, Kafka asks the kernel to copy straight from the page cache to the socket — bypassing user space entirely. No serialization, no JVM heap churn. (Enabling TLS disables zero-copy, because the bytes must be encrypted in user space — a real, measurable cost of encryption.)

Design consequence: keep consumers caught up. A fleet reading the tail is cheap; a fleet replaying history competes for page cache and IO with everyone else.

Sell off the cooling rack

A bakery that sells bread straight off the cooling rack (page cache) serves a queue instantly. The moment someone orders a loaf from last week (a lagging consumer), staff must go to the cold cellar — slowing the whole counter. Kafka is fast precisely because most customers want the loaf that's still warm.

Reflect

Look at your worst-performing broker.

  • How much of its IO is tail reads vs history replays right now?
  • Which batch job replays from offset 0 during peak — and could it run off-peak?
  • Are you paying the TLS zero-copy tax on internal traffic that could use a private network instead?

Reading in progress · 0 of 1 activity done