Decision graph of an agent execution trajectory showing per-step constraints, violations, and evidence for diagnosing AI agent failures

Diagnosing AI Agent Failures with AgentRx — Practical Guide for Teams

By Agustin Giovagnoli / March 12, 2026

Reliable agentic systems demand repeatable, auditable methods for diagnosing AI agent failures—not ad-hoc log spelunking. AgentRx, a Microsoft Research effort, formalizes “what should happen” at each step in an agent’s trajectory and then checks reality against those expectations to pinpoint where and why things went wrong [1].

What is AgentRx? Overview of the framework and dataset

AgentRx is a systematic, domain-agnostic framework for analyzing agent execution trajectories. The team assembled a curated benchmark of 115 failed runs across three domains—structured API workflows, incident management, and open-ended web/file tasks. Each trajectory is manually annotated with the exact step where the critical failure occurs and categorized using a grounded-theory, cross-domain taxonomy. The framework then automates much of the diagnostic work with constraint generation, auditable violation logs, and a language model–based judge to identify the most critical failure and assign a category [1].

Diagnosing AI agent failures with constraints and an LLM judge

AgentRx encodes expected behavior as constraints for each step in an agent’s trajectory. It evaluates those constraints step-by-step, producing a transparent log of violations and supporting evidence. A large language model–based judge consumes this evidence to localize the critical failure and explain its cause. In evaluations, this pipeline outperformed baseline approaches on both failure localization and cause attribution, offering a more reliable path to execution trajectory debugging in complex workflows [1].

Benchmarks and results: why it matters

The benchmark spans three distinct domains, allowing consistent comparison of failure types and diagnostic performance. By tying each failure to an exact step and taxonomy category, teams can build incident patterns and response playbooks grounded in real data. The reported improvements over baselines in localizing and explaining failures highlight a practical path to diagnosing AI agent failures at scale—especially where multi-step reasoning, tool usage, and branching behavior complicate root-cause analysis [1].

Limitations and open questions

While broad for an initial release, the dataset covers 115 failed trajectories across three domains; teams should expect to extend or specialize constraints for their own environments. Operational considerations—like cost, latency, and the potential for constraint false positives—remain important when translating research into production pipelines [1].

Integrating AgentRx ideas into production: observability and tooling

AgentRx’s approach aligns with emerging orchestration and observability stacks. The open-source Microsoft Agent Framework helps teams build and operate single- and multi-agent workflows in .NET and Python, aiming to make production systems more manageable and observable [2][3]. Industry guidance underscores the role of autonomous observability agents that monitor metrics, run root-cause analysis, and even remediate certain misconfigurations in real time—capabilities that complement constraint-driven diagnostics for diagnosing AI agent failures during incidents [4]. Practical observability guides also emphasize end-to-end execution tracing: capturing decision graphs, latency, branching, and tool calls to drive faster triage and LLM-assisted summaries [5].

For broader context on Microsoft’s research directions, see the Microsoft Research homepage (external).

Operational playbook: from telemetry to action

A pragmatic workflow inspired by AgentRx focuses on traceability and evidence:

Capture the full execution trajectory: inputs, intermediate reasoning, tool calls, outputs, and timing [1][5].
Define per-step expectations and generate constraints that express correct behavior for each node in the graph [1].
Automate constraint checks and log any violations with supporting evidence for auditability [1].
Use an LLM-based judge to synthesize violation logs, localize the critical failure step, and produce a clear explanation for responders—accelerating diagnosing AI agent failures during on-call triage [1].
Feed results into runbooks and dashboards to track mean time to detect (MTTD), mean time to resolve (MTTR), and recurring failure categories over time [5].

Teams building multi-agent systems can layer these steps onto their orchestration stack, using visualization to spot branching dead-ends, slow tools, and ambiguous planner actions typical of execution trajectory debugging [2][3][5].

Common failure modes and how to spot them

Operational accounts catalog recurring failure types, including malformed tool calls, planner errors, and conflicts between agents. These failures often show up as characteristic telemetry patterns and can be mitigated with design changes, validation, or guardrails—reinforcing the value of constraint checks and auditable traces as a first line of defense [6].

Conclusion and resources

AgentRx brings structure and scale to agent debugging: per-step constraints, violation logs, and an LLM judge to localize and explain issues—grounded in a benchmark of 115 failed trajectories across three domains [1]. For teams moving from prototypes to production, the Microsoft Agent Framework provides orchestration and operational footing for single- and multi-agent systems [2][3]. To go deeper, read the research and explore tooling:

The AgentRx research publication [1]
Microsoft Agent Framework on GitHub [2]
Microsoft’s framework announcement blog [3]

For hands-on guidance and templates, Explore AI tools and playbooks.

Sources

[1] AgentRx: Diagnosing AI Agent Failures from Execution Trajectories
https://www.microsoft.com/en-us/research/publication/agentrx-diagnosing-ai-agent-failures-from-execution-trajectories/

[2] Welcome to Microsoft Agent Framework! – GitHub
https://github.com/microsoft/agent-framework

[3] Introducing Microsoft Agent Framework: The Open-Source Engine …
https://devblogs.microsoft.com/foundry/introducing-microsoft-agent-framework-the-open-source-engine-for-agentic-ai-apps/

[4] Autonomous Observability: AI Agents That Debug AI
https://www.computer.org/publications/tech-news/community-voices/autonomous-observability-ai-agents

[5] Mastering AI agent observability: A comprehensive guide – Medium
https://medium.com/online-inference/mastering-ai-agent-observability-a-comprehensive-guide-b142ed3604b1

[6] How to Debug AI Agents: 10 Failure Modes + Fixes | Galileo
https://galileo.ai/blog/debug-ai-agents