Retries, Dead-Letter Queues & Poison Pills

What happens to the one record your handler can never process?

0/3 done

Retry tiers + DLQ

The poison pill problem

A poison pill is a record your consumer can never successfully process — a malformed payload, a referenced row that was deleted, a bug only that record triggers. With naive at-least-once retry, the consumer fails, does not commit the offset, re-polls the same record, and blocks the entire partition forever. One bad message silently stops a revenue stream.

The production pattern is a tiered escalation:

  1. In-memory retry with backoff for transient faults (network blip, lock timeout). Bounded — e.g. 3 attempts with exponential backoff + jitter.
  2. Retry topic(s) for slower transients. Republish to orders.retry.5s, a consumer with a delay reprocesses later — the original partition keeps flowing.
  3. Dead-letter queue (DLQ) for true poison pills: after N attempts, publish to orders.DLQ with headers recording the error, stack, original topic/partition/ offset and attempt count. Alert on DLQ depth; never let it grow silently.

Ordering caveat: retry/DLQ topics break per-key ordering for the affected records. If strict ordering matters, you must halt the key rather than skip ahead — a real trade-off, not a free lunch.

DLQ-on-failure consumer (Python)

Wrap a handler so transient errors retry with backoff, and a record that exhausts its attempts is routed to a <topic>.DLQ with diagnostic headers — then the offset is committed so the partition advances.

Reflect

Find a consumer with no DLQ today.

  • What happens *right now* if it hits a record it can never process?
  • Would a retry topic or an in-memory backoff cover most of your real failures?
  • For your strict-ordering streams, is 'skip to DLQ' or 'halt the key' the correct policy?

Reading in progress · 0 of 3 activities done