Multi-Agentic Systems

Intro

A multi-agentic system coordinates two or more LLM agents — each with its own context window, tools, and instructions — to solve a task that a single agent handles poorly. The Agents page covers what agents are, the augmented LLM building block, and autonomous agent design. This page covers all agentic workflow patterns — from simple prompt chaining to multi-agent orchestration — along with communication protocols, coordination structures, and failure modes.

Multi-agent typically uses 3–10× more tokens than single-agent for equivalent tasks, driven by context duplication and coordination messages. That cost is justified under three specific conditions:

Context pollution — a subtask generates over 1,000 tokens of irrelevant context that degrades the main agent's reasoning quality.
Parallelization — independent work paths can run concurrently, and sequential execution is unacceptably slow.
Specialization — the agent has 20+ tools and selection accuracy drops, or the task requires conflicting behavioral modes (empathetic support vs. precise code review in the same session).

If none of these apply, a single well-prompted agent with good tools outperforms multi-agent on cost, latency, and debuggability. Anthropic reports that teams have invested months building multi-agent architectures only to discover that improved prompting on a single agent achieved equivalent results.

The key design principle is context-centric decomposition: split agents along context boundaries, not problem boundaries. An agent handling a feature should also handle its tests — it already has the context. Only introduce a new agent when one genuinely cannot hold the relevant context in its window. Problem-centric splits (one agent writes code, another writes tests, a third reviews) force constant coordination and lose information at each handoff — a "telephone game" where fidelity drops with every transfer.

Communication Patterns

Agents must share context to coordinate. Three mechanisms dominate production systems, each with a different fidelity-cost tradeoff.

Full history passthrough. The receiving agent gets the entire prior conversation. OpenAI Agents SDK does this by default on handoff. Simple to implement, but context grows unboundedly — after 10+ handoffs the receiving agent's window fills with irrelevant history, and reasoning quality degrades from "lost in the middle" effects.

Scoped context (filtered handoff). The orchestrator decides what each downstream agent needs and passes only that subset. Anthropic's Research system uses this: subagents write outputs to a filesystem store and pass lightweight references back to the coordinator, preventing information loss while keeping context compact. OpenAI's SDK provides input_filter callbacks and built-in filters like remove_all_tools (strips tool call history from handoff context). This is the production-recommended approach.

Shared external state (blackboard). A central store — vector database, Redis, filesystem — holds system state. Agents read and write independently without direct messaging. The blackboard pattern works best for non-linear problems where the step sequence is unknown upfront. Agents don't know about each other, only the shared state. The tradeoff: race conditions on concurrent writes and no built-in ordering guarantees.

Workflow Patterns

Five patterns cover the spectrum from simple single-LLM orchestration to multi-agent coordination. They form a progression of increasing complexity — start with the simplest pattern that solves the problem.

Prompt Chaining

flowchart LR
    In[Input] --> S1[Step 1 LLM] --> G1{Gate} --> S2[Step 2 LLM] --> G2{Gate} --> Out[Output]

Break a task into sequential steps where each LLM call processes the output of the previous one. Add programmatic checks (gates) between steps to verify the process stays on track.

When to use: tasks that decompose cleanly into fixed subtasks. Example: generate marketing copy then translate it, or write an outline, validate it meets criteria, then write the document.

Routing

flowchart TD
    In[Input] --> R[Router LLM]
    R --> P1[Prompt or Model A]
    R --> P2[Prompt or Model B]
    R --> P3[Prompt or Model C]

Classify the input and direct it to a specialized prompt or model. This lets you optimize each downstream path independently — a change to handle refund requests will not degrade general question answering.

When to use: distinct input categories that need different handling. Example: route customer queries to a small fast model for general questions, a larger model for complex technical issues, a constrained workflow for refund requests.

Parallelization

flowchart TD
    In[Input] --> A[LLM Call A] & B[LLM Call B] & C[LLM Call C]
    A --> Agg[Aggregator]
    B --> Agg
    C --> Agg
    Agg --> Out[Output]

Run multiple LLMs simultaneously and aggregate results. Two variants: sectioning splits independent subtasks across parallel agents; voting runs the same task through multiple agents for higher confidence. When to use: independent subtasks that benefit from speed, or tasks where multiple perspectives improve reliability — running guardrails in parallel with the main response, multi-aspect code review, content moderation with vote thresholds.

Orchestrator-Workers

flowchart TD
    In[Input] --> O[Orchestrator LLM]
    O --> W1[Worker 1] & W2[Worker 2] & W3[Worker 3]
    W1 --> S[Synthesize]
    W2 --> S
    W3 --> S
    S --> Out[Output]

A central LLM dynamically decomposes the task, delegates subtasks to worker LLMs, and synthesizes results. The subtasks are not predefined — the orchestrator determines them based on the input. Topologically similar to parallelization, but the key difference is flexibility: workers and their tasks are determined at runtime. Anthropic's Research system uses Claude Opus 4 as lead with Sonnet 4 subagents — 3 to 5 spawned in parallel, achieving 90.2% improvement over single-agent. The dominant production pattern for complex coding and research tasks.

Evaluator-Optimizer

flowchart TD
    In[Input] --> G[Generator LLM]
    G --> D[Draft]
    D --> E[Evaluator LLM]
    E -->|Revise| G
    E -->|Accepted| Out[Final Output]

One LLM generates a response; another evaluates it against criteria and provides feedback. The loop continues until the evaluator approves or an iteration cap is hit. Two indicators of good fit: LLM responses demonstrably improve when given human-like feedback, and the LLM can provide such feedback. When to use: tasks with clear evaluation criteria — literary translation with nuance, complex search requiring multiple rounds, code review, compliance checking.

Multi-Agent Coordination

Beyond workflow patterns, multi-agent systems use three structural patterns for organizing agent interactions.

Handoff / triage. One active agent at a time. The current agent decides dynamically when to transfer control to a specialist. In Microsoft Agent Framework, AgentWorkflowBuilder declares a handoff routing graph where each agent receives transfer targets as tool definitions:

using Microsoft.Agents.AI;
using Microsoft.Agents.AI.Workflows;
using Microsoft.Extensions.AI;

// Create specialized agents from an IChatClient
ChatClientAgent triageAgent = new(chatClient,
    "Route customer issues to the appropriate specialist.",
    "triage_agent",
    "Routes to the right specialist");

ChatClientAgent statusAgent = new(chatClient,
    "Check order status. Transfer back to triage if not status-related.",
    "order_status_agent",
    "Handles order status queries");

ChatClientAgent refundAgent = new(chatClient,
    "Process refund requests. Transfer back to triage if not refund-related.",
    "refund_agent",
    "Handles refund requests");

// Declare the handoff routing graph
Workflow workflow = AgentWorkflowBuilder
    .CreateHandoffBuilderWith(triageAgent)
    .WithHandoffs(triageAgent, [statusAgent, refundAgent])
    .WithHandoffs([statusAgent, refundAgent], triageAgent)
    .Build();

// Execute the workflow
List<ChatMessage> messages =
    [new(ChatRole.User, "I need a refund for order 321 — item was damaged")];

Run result = await InProcessExecution.RunAsync(workflow, messages);
foreach (WorkflowEvent evt in result.NewEvents)
{
    if (evt is WorkflowOutputEvent output)
        Console.WriteLine($"Result: {output.Data}");
}

When an agent calls a handoff tool, control transfers with the conversation history. The routing decision is LLM-driven — each agent decides when to transfer based on its instructions, not a rule engine. Specialists can transfer back to triage if the issue is outside their scope.

Group chat / debate. Multiple agents in a shared conversation thread with a chat manager controlling turn order. When to use: consensus-building, brainstorming, compliance review. Keep to 3 or fewer agents — beyond that, coordination cost dominates.

Swarm (peer-to-peer). Agents communicate directly without central control. Each agent independently decides when and where to transfer. Rarely used in production — the lack of central coordination makes debugging and error recovery significantly harder. Most teams eventually add a supervisor.

Pitfalls

Context Loss at Handoffs

Information clear to Agent A gets compressed, omitted, or distorted when passed to Agent B. Sequential chains are worst — earlier messages get compressed at each hop, eroding fidelity progressively.

Why it happens: the "Goldilocks dilemma" — pass full context and instruction density drops (agent loses focus); summarize and edge cases vanish. Natural language handoffs lack schema enforcement, so semantic errors pass silently without raising runtime exceptions.

Mitigation: use the filesystem artifact pattern — agents write structured outputs to external storage and pass lightweight references. Define explicit output schemas for inter-agent communication. Validate agent output before passing to the next agent — reject low-confidence or malformed responses.

Coordination Cost Explosion

Interaction complexity scales as n(n−1)/2: 2 agents = 1 interaction, 4 = 6, 10 = 45. A task costing $0.10 for a single agent may cost $1.50 for multi-agent after coordination overhead and context duplication. Multi-agent systems use roughly 15× more tokens than equivalent chat interactions.

Why it happens: every handoff duplicates context. Coordination messages consume tokens without advancing the task. A known anti-pattern is the "politeness loop" — two agents enter a cycle of thanking each other, burning tokens without advancing the task. Free-form conversation between agents has no built-in termination guarantee, and without max_turns caps these loops can run for hours before detection.

Mitigation: use structured output types between agents instead of free-form conversation. Set max_turns on every agent. Monitor per-run token usage and alert on outliers. Add agents only when you can demonstrate measurable improvement over fewer.

Deadlocks and Infinite Loops

Circular dependencies — A waits on B, B waits on C, C waits on A — hang silently, burning budgets without crashing. Maker-checker loops without iteration caps refine indefinitely.

Why it happens: natural language coordination has no built-in timeout or deadlock detection. Unlike distributed systems with formal protocols, there is no heartbeat or lease mechanism by default.

Mitigation: lease-lock patterns with TTL on agent-to-agent waiting. Single orchestrator owning state transitions. Explicit iteration caps on every loop with fallback behavior — escalate to human or return best result with a quality warning. Circuit breaker patterns for agent dependencies.

Cascading Errors

An error in one agent propagates through the system, amplified at each step. A hallucinated fact from Agent A becomes trusted input for Agent B, which builds further conclusions on it. If those conclusions reach persistent memory, they contaminate future runs.

Why it happens: semantic opacity — natural language errors pass as "valid" data. Agents trust upstream output by default, and there is no schema validation for factual correctness. Parallelization amplifies the problem — one faulty planning step spawns dozens of workers propagating the same error.

Mitigation: validate outputs at each agent boundary before passing downstream. Use independent verification agents for high-stakes decisions. Enforce guardrails at the infrastructure layer (network egress, filesystem permissions, execution budgets) rather than the prompt layer — agents can reason around app-level restrictions but cannot bypass environment-level enforcement.

Tradeoffs

Factor	Single Agent	Multi-Agent
Token cost	1× baseline	3–10× overhead
Latency	Sequential tool calls	Parallelizable, but coordination adds overhead
Debuggability	Single linear trace	Multiple interleaving traces
Context window	Limited by one window	Each agent gets a fresh window
Tool management	All tools loaded (degrades at 20+)	Specialized toolsets per agent
Failure surface	Agent-level only	Agent + coordination failures

The "bitter lesson" of multi-agent: elaborate coordination architectures built to work around current model limitations risk obsolescence. A 10-agent system may be outperformed by a single next-generation model with a larger context window. Build multi-agent only when the coordination cost is justified by measurable improvement today — not as speculative architecture for tomorrow's problems.

Questions

When is multi-agent coordination justified over a single agent with more tools?

Justified under three conditions: context pollution (subtask degrades main agent reasoning), parallelization (independent paths need concurrent execution), specialization (20+ tools degrade selection accuracy, or conflicting behavioral modes needed)
If none apply, single agent wins: 3–10× fewer tokens, lower latency, single linear trace for debugging
Many teams investing months in multi-agent discover equivalent results from better prompting on one agent
Key tradeoff: multi-agent buys context isolation and parallelism at the cost of coordination overhead and debugging complexity

Why does context-centric decomposition outperform problem-centric decomposition?

Problem-centric (code agent + test agent + review agent) forces constant coordination — each agent needs context from the others, creating lossy handoffs
Context-centric splits along natural context boundaries — agent handling a feature also handles its tests because it already has the context
Introduce a new agent only when context genuinely cannot fit in one window
Reduces handoff count, cuts token overhead, prevents compounding information loss at each transfer
Key tradeoff: context-centric may produce broader agents (more tools per agent), but avoids the "telephone game" of multi-hop handoffs

What makes multi-agent failures harder to diagnose than single-agent failures?

Semantic opacity: natural language errors pass as "valid" data between agents — no schema violations, no exceptions raised. A hallucinated fact from Agent A becomes trusted input for Agent B
Non-linear traces: multiple interleaving reasoning chains with handoffs instead of one sequential trace, making root cause analysis harder
Emergent behavior: agent interactions produce outcomes no single agent's instructions predict
Known anti-pattern: two agents entering a politeness loop, each thanking the other, consuming budget without task progress — correct behavior per agent, catastrophic in combination
Key tradeoff: multi-agent gains specialization but loses the single-trace debuggability that makes single-agent failures straightforward to fix

References

Whats next

Parent
LLM

Pages