Retry tiers + DLQ
The poison pill problem
A poison pill is a record your consumer can never successfully process — a malformed payload, a referenced row that was deleted, a bug only that record triggers. With naive at-least-once retry, the consumer fails, does not commit the offset, re-polls the same record, and blocks the entire partition forever. One bad message silently stops a revenue stream.
The production pattern is a tiered escalation:
- In-memory retry with backoff for transient faults (network blip, lock timeout). Bounded — e.g. 3 attempts with exponential backoff + jitter.
- Retry topic(s) for slower transients. Republish to
orders.retry.5s, a consumer with a delay reprocesses later — the original partition keeps flowing. - Dead-letter queue (DLQ) for true poison pills: after N attempts, publish to
orders.DLQwith headers recording the error, stack, original topic/partition/ offset and attempt count. Alert on DLQ depth; never let it grow silently.
Ordering caveat: retry/DLQ topics break per-key ordering for the affected records. If strict ordering matters, you must halt the key rather than skip ahead — a real trade-off, not a free lunch.