finding the culprit: when multi agent systems Fail

When AI agents work in teams, failures happen. A lot. Someone gives wrong information, another agent makes a bad decision, and the whole system crashes. The question is: who's to blame?

Right now, when a team of AI agents fails at a task, humans spend hours reading through conversation logs trying to figure out which agent caused the problem. It's expensive, time-consuming, and frankly, pretty boring work.

I came across a paper called "Which Agent Causes Task Failures and When?" that tried to solve this automatically. The authors built a dataset of 184 real agent team failures and tested three methods to identify the culprit. Their best approach hit 53.5% accuracy - better than guessing, but not by much.

Reading this paper, I had a simple thought: what if agent conversations are just networks? What if we could use graph theory to find patterns that reveal who caused the failure?

the Graph Theory approach

Here's the core insight: when agents talk to each other, they create a network. Some agents are central to the conversation. Others are on the periphery. Some agents influence many others. Some just respond to what they're told.

In failure scenarios, I suspected that certain network positions would be more likely to indicate the culprit. Maybe the most active agents cause more problems. Maybe agents who influence many others are more likely to spread bad information.

So I downloaded the Who&When dataset and started building graphs.

methodology: building networks from conversations

Each agent conversation became a directed graph:

Nodes = agents
Edges = who responds to whom
Edge weights = frequency of interaction

I then applied standard graph algorithms:

PageRank: Measures influence and centrality
Betweenness Centrality: Identifies agents who control information flow
Community Detection: Finds clusters of agents that work together

the surprising discovery

After building the initial graphs, I hit my first major bug. My code was creating self-loops - every agent was only "talking to themselves" instead of connecting to other agents. This meant the graph wasn't capturing agent interactions at all.

But here's the interesting part: even with this "bug," the approach was working. It was essentially measuring "who talks most" rather than "who influences whom." And agents who talk more were indeed more likely to be the culprits.

This led to a key insight: activity-based failure detection works. Agents who are more verbose, who participate more heavily in conversations, are statistically more likely to cause failures.

systematic optimization

I tested different combinations systematically:

Ensemble weights: 70% PageRank + 20% Betweenness + 10% Community detection gave the best results.

Temporal decay: Recent messages weighted more heavily improved accuracy slightly.

Graph construction: Tried real agent-to-agent networks vs. activity-based self-loops. Activity-based consistently performed better.

After testing multiple variations, the optimal approach achieved 50.8% accuracy (64 correct out of 126 valid cases).

results analysis

50.8% doesn't sound impressive until you consider the context:

Random guessing: 25% accuracy
Academic baseline: 53.5% accuracy
Our gap: Only 2.7 percentage points

More importantly, when I analyzed the failures, the actual culprits were consistently in our top 3 predictions. We weren't missing them - we were just ranking them slightly wrong.

what this means

This work establishes that simple activity patterns contain significant signal for failure attribution. You don't need complex semantic analysis or expensive LLM-based approaches. Basic graph metrics on conversation patterns can identify failure-causing agents better than random with a simple, fast algorithm.

Once refined further, this approach could replace the current trend of using "LLM as a judge" for agent evaluation - delivering faster, more consistent, and more explainable results.

the ceiling problem

Every enhancement I tried made results worse. Adding content analysis, position-based weighting, complex heuristics - all decreased accuracy. This suggests we've hit the fundamental ceiling for observational methods.

The remaining cases likely require causal validation: actually testing whether muting a suspected agent prevents the failure. That's a different problem requiring different tools.

code and reproducibility

The complete implementation is available on GitHub. The core algorithm is surprisingly simple - about 50 lines of Python using NetworkX. Sometimes the best solutions are the obvious ones.

This research establishes the baseline for graph-based agent failure attribution and identifies the limits of activity-based detection methods.Moving Forward

I'm now working on implementing the counterfactual replay system. The goal is to prove causal relationships and push accuracy above the academic baseline. But beyond just beating benchmarks, I want to understand why multi-agent systems fail and how to design better ones.