Sequential IO + page cache + sendfile
The mechanical sympathy behind the throughput
Kafka routinely pushes gigabytes/second per broker on commodity disks. The secret is that it cooperates with the operating system instead of fighting it:
- Sequential, append-only writes. Spinning disks and SSDs are both an order of magnitude faster at sequential IO than random IO. An append-only log is the most disk-friendly access pattern that exists.
- The OS page cache is the read cache. Kafka does not maintain its own in-process cache. Recently written segments are still in the kernel page cache, so consumers reading the tail almost never touch disk — they read from RAM the producer just warmed. This is why lagging consumers hurt: they force cold-segment reads that evict the hot pages everyone else relies on.
- Zero-copy via
sendfile(2). To ship bytes to a consumer, Kafka asks the kernel to copy straight from the page cache to the socket — bypassing user space entirely. No serialization, no JVM heap churn. (Enabling TLS disables zero-copy, because the bytes must be encrypted in user space — a real, measurable cost of encryption.)
Design consequence: keep consumers caught up. A fleet reading the tail is cheap; a fleet replaying history competes for page cache and IO with everyone else.