Which signal is most useful for debugging *why* an agent picked a wrong tool?

The state snapshot and routing decision at the supervisor node.

The size of the prompt template.

Which signal is most useful for debugging *why* an agent picked the wrong tool?

The state snapshot and routing decision recorded at the supervisor node.

The size of the prompt template.

Tracing agent runs — Semantic Web Academy

Overview

Deconstructing non-deterministic agent executions into discrete, auditable span trees.

Why it matters

Because multi-agent graphs execute asynchronously and loop conditionally, standard logging collapses. We implement structured tracing tools to inspect real-time state snapshots at specific nodes, track token cost accumulation per step, isolate edge traversal latency, and pinpoint exactly where routing logic derailed.

How it actually works

Agents loop and branch non-deterministically, so flat logs collapse. Tracing records a span tree: per-node state, the routing decision taken, latency, and token cost — the data you need to answer 'why did it do that?'.

trace:
  run_id: run_123
  spans:
    - {node: supervisor, decision: graphQuery, state_keys: [question, risk], latency_ms: 42}
    - {node: graphQuery, cypher: 'MATCH path=(p)-[*1..2]-(x)', rows: 18, latency_ms: 131}
    - {node: critic, unsupported_claims: 1, routed_to: revise}

Trace the decision, not just the output. To debug why the agent picked the wrong tool, you need the supervisor's state snapshot and the branch it chose at that moment — the final answer tells you nothing about the fork that doomed it. Per-node decisions are the difference between 'it was wrong' and 'it routed to vectorSearch because confidence read 0.61'.

Attribute cost per span. Adding token_cost_usd per node turns tracing into a cost microscope: you find the one node responsible for most spend and optimise it, instead of guessing. Latency per span does the same for your p95.

Stable run/thread ids stitch spans into one story and link a trace to the checkpoint that produced it — so an incident investigation can replay the exact run, not a reconstruction.

Analogy

Agent tracing is a flight data recorder, not a single cockpit photo. After an incident you replay every control input and instrument reading in sequence (per-node decisions), which is how you learn the turn that caused the crash — the wreckage (final output) alone never tells you.

Pitfalls & how to avoid them

Logging only final output. Symptom: can't explain routing. Fix: per-node state + decision spans.
No cost per span. Symptom: blind cost optimisation. Fix: token_cost_usd per node.
Unlinked spans. Symptom: can't reconstruct a run. Fix: stable run/thread ids.
No link to checkpoints. Fix: tie traces to the checkpoint that produced them for replay.

Apply it to your system

Picture debugging a bad agent run tomorrow.

›Could you see which branch the supervisor took and the state behind it?
›Which node is your biggest cost or latency contributor right now?
›Can you replay a specific run from its trace and checkpoint?

Reading in progress · 0 of 4 activities done