JMX Metrics & Consumer-Lag Monitoring

If you only monitor one thing, monitor lag.

0/3 done

JMX + Prometheus + alerts

The metrics that matter

Kafka exposes broker, producer, consumer and Streams internals via JMX. In a modern deployment you scrape them with the JMX Exporter for Prometheus and alert on:

  • Consumer lag per group, per partition — kafka_consumergroup_lag.
  • Under-replicated partitionskafka_server_replicamanager_underreplicatedpartitions.
  • Request latency p99 per API.
  • Disk usage per log dir.

Tools like Burrow and Cruise Control turn raw lag into health calls and automatic partition rebalancing respectively.

Prometheus alerting rule for consumer lag

Author a Prometheus alerting rule that fires when any consumer group is more than 100k messages behind for >5 minutes on any partition.

Reading in progress · 0 of 3 activities done