Operating GraphRAG means owning graph-specific incidents that naive RAG never has: stale subgraphs, alias-merge regressions, and path explosion under load. Day-2 ops turns these risks into monitored SLOs and runbooks.
index_health: { graph_last_refresh_minutes: 12, vector_last_refresh_minutes: 9 }
runbooks:
stale_graph: force incremental replay from the event log
alias_regression: roll back the resolver model and rebuild the affected subgraph
slos: { p95_latency_ms: 2200, faithfulness_min: 0.92 }
The signal unique to GraphRAG is refresh skew. Track how far the graph and vector indexes lag the source event stream — and the gap between them. A graph refreshed 12 min ago against a vector index refreshed 9 min ago can answer from inconsistent evidence; skew over a threshold is your earliest staleness alarm, well before users see wrong answers.
Every graph-specific risk needs a runbook with an owner. 'Stale graph → replay from event log', 'alias regression → roll back resolver and rebuild that subgraph'. Without pre-written runbooks, a 2 a.m. alias-merge incident becomes an improvised, high-blast-radius rebuild of the whole graph.
Carry the quality gates into runtime. The release-time thresholds (path-support, unsupported-claim rate) become live SLOs; when they breach in production, that's an incident, not a dashboard curiosity. Ops is where Levels 2–4 stop being build-time concerns and become pager duty.