12 Multi-Agent Orchestration

A single AI agent can modify a file, and even several files in sequence if the changes are simple enough and the context window holds. But the moment a cross-cutting change spans 40 files across 5 concerns, a single agent becomes the bottleneck — not because it lacks intelligence, but because it lacks bandwidth. Multi-agent orchestration solves the coordination problem by decomposing work across specialized agents that each operate within manageable context budgets. But coordination is not free. Agents that work in parallel can conflict, produce inconsistent output, or break each other’s work. The discipline of multi-agent orchestration is the discipline of getting the benefits of parallelism without paying the costs of chaos.¹

This chapter covers when to use multiple agents, how to specialize them, how to run them in parallel safely, how to resolve conflicts when they arise, and what the human’s role is in all of it.

12.1 When One Agent Is Enough

Not every task requires multiple agents. The overhead of orchestration (partitioning work, managing sessions, resolving conflicts, validating independently) is real. If the task fits comfortably in a single agent’s context, the single agent is the better choice.

A single agent is sufficient when:

Scope is narrow. The change touches fewer than 10 files in a single module.
Concern is singular. One type of change (fix logging, update types, add tests), not three interleaved concerns.
Dependencies are linear. Each file change follows naturally from the previous one, with no need for parallel work.
Context budget is adequate. The agent can hold all relevant source files, instructions, and conversation history without exceeding roughly 60% of its window capacity, leaving room for reasoning.

A single agent breaks down when:

Multiple concerns intersect. The change requires architectural knowledge and domain expertise and security awareness. One agent cannot hold all three specialization contexts simultaneously without dilution.
File count exceeds context capacity. More than 15-20 files means the agent cannot see all the code it needs to modify.
Parallelism would reduce wall-clock time significantly. Five independent file groups that could be modified simultaneously instead take five times as long sequentially.

The decision matrix:

Dimension	Single agent	Multiple agents
Files changed	< 10	> 15
Concerns	1	2+
File dependencies	Linear	Graph (can parallelize)
Required expertise	One domain	Multiple domains
Time pressure	Low	Moderate to high
Risk of context overload	Low	High

The boundary at 10-15 files is approximate and experience-derived, though not precisely measured. It reflects the practical limit where a single agent’s conversation history — accumulated tool calls, file reads, edit confirmations, test output — begins consuming enough context to crowd out the instructions and source code that the agent needs to do its work well. Your mileage will vary by model, task complexity, and instruction file size.

When the decision is marginal, err toward a single agent. Coordination costs are real. Multi-agent orchestration is a tool for tasks that exceed single-agent capacity, not a default mode of operation.

12.2 Agent Specialization Patterns

The key insight behind multi-agent orchestration is that specialization produces better results than generalization — for the same reason that a team of specialists outperforms a team of generalists on complex projects. A security expert and a logging expert, each working within their domain, produce more reliable output than a single agent told to “handle security and logging.”

Specialization works because it reduces the context each agent needs to carry. An architecture agent loaded with type definitions, module boundaries, and architectural patterns does not also need logging conventions, output formatting rules, and symbol dictionaries. The context it receives is concentrated, not diluted.

Three specialization patterns recur across most multi-agent workflows.

12.2.1 Pattern 1: Writer / Reviewer / Tester

The most common pattern separates code production from code validation. One agent writes the code. A second agent reviews it. A third writes or updates tests.

flowchart TD
    W["Writer agent"] -->|"Code changes"| R["Reviewer agent"]
    R -->|"Findings: bugs,<br/>logic errors, security"| T["Tester agent"]
    T -->|"Test updates +<br/>verification"| Done["Verified output"]

Figure 12.1: Writer, Reviewer, Tester pattern

This pattern maps directly to the human workflow of author, reviewer, and QA — and for the same reason. The writer optimizes for correctness and completeness. The reviewer optimizes for catching what the writer missed. The tester optimizes for verifiable behavior. These are different cognitive tasks that benefit from different contexts.

In practice, the reviewer agent receives the diff plus the original source, not the writer’s full conversation history. This is deliberate. The reviewer should evaluate the output on its own merits, not be anchored by the writer’s reasoning. If the writer had a good reason for a decision but the code doesn’t reflect it, that is a signal, not an excuse.

12.2.2 Pattern 2: Domain Teams

For cross-cutting changes, organize agents by area of expertise rather than by workflow stage. Each team owns a concern and is responsible for all files related to that concern.

Aspect	Architecture Team	Domain Expert Team
Context loaded	Type definitions, Module boundaries, Pattern catalog, Dependency graph	Output conventions, Symbol dictionaries, UX guidelines, Migration patterns
Owns	Type safety fixes, Dead code removal, API consolidation	Verbose coverage, Logger migration, Formatting cleanup

In the auth-logging overhaul (PR #394, detailed in the case study and summarized in Chapter 13), this two-team structure (architecture team led by an architect agent, domain team led by a logging expert agent) handled a 75-file change across five concerns. The architecture team carried type definitions and architectural patterns. The domain team carried output conventions and migration examples. Neither team needed the other’s context, and both produced output consistent with their specialization.

This pattern scales by adding teams. A security concern adds a security team. A documentation concern adds a documentation team. Each team brings its own specialized context, its own instruction files, and its own validation criteria. The coordination cost is between teams, not within them.

12.2.2.1 Concrete Dispatch: What It Actually Looks Like

The Domain Teams pattern describes structure. Here is what a dispatch actually looks like in practice: the instruction files, the prompt, and the file list. This example uses a terminal-based agent, but the pattern applies regardless of tool.

Before dispatching, the orchestrator prepares three things: the instruction files the agent will load, the file list it owns exclusively, and the task prompt.

Instruction files loaded into context:

.ai/instructions.md              # Project-wide conventions (always loaded)
.ai/integrations/logging.md      # Logging-specific patterns and examples

The dispatch prompt:

Migrate the following files from print-based output to the structured
logger established in Wave 0. Use the LoggerFactory pattern from
src/core/logger.py (committed and tested).

Files assigned to you (exclusive ownership this wave):
  - src/commands/install.py
  - src/commands/resolve.py
  - src/commands/validate.py

Constraints:
  - Do NOT modify any file not in this list.
  - Do NOT change function signatures or public APIs.
  - Preserve all existing behavior — update log output only.
  - Use _rich_info() for informational messages, _rich_warning()
    for warnings. See .ai/integrations/logging.md for examples.

When complete, run: pytest tests/commands/ -x
Fix any failures before reporting done.

What makes this dispatch effective:

Exclusive file list. The agent knows exactly which files it owns. No ambiguity, no overlap with other agents in this wave.
Committed reference. “Established in Wave 0” means the agent reads the actual committed code, not a description of what it should look like.
Scoped instructions. Two instruction files, not twelve. The agent carries logging conventions and project conventions. It does not carry type system rules, security policies, or deployment procedures irrelevant to this task.
Built-in validation. The prompt ends with a test command. The agent self-validates before reporting completion (L1 self-heal).
Explicit constraints. “Do NOT modify any file not in this list” is the one-file-one-agent rule, stated in the agent’s own terms.

The orchestrator dispatches this prompt in a fresh session, monitors for completion or escalation, and moves to the next dispatch. Total orchestrator time per dispatch: roughly 2-3 minutes to prepare the prompt and file list, plus monitoring time shared across all active agents in the wave.

12.2.3 Pattern 3: Audit / Execute / Validate

For exploratory work where the scope is not fully known in advance, separate the agents that discover what needs to change from the agents that make the changes.

flowchart TD
    A["Audit agents<br/>(read-only)"] -->|"Findings: files,<br/>severity, recommendations"| B["Planning<br/>(human decision)"]
    B -->|"Scoped tasks with<br/>file assignments"| C["Execution agents<br/>(read-write)"]
    C -->|"Code changes"| D["Validation agents<br/>(read-only)"]
    D -->|"Review findings,<br/>test results"| E["Ship"]

Figure 12.2: Audit, Execute, Validate pattern

The critical property of this pattern is the separation between read-only and read-write operations. Audit agents explore the codebase without modifying it. This means you can dispatch multiple audit agents simultaneously with no risk of interference. They can examine the same files, look at overlapping concerns, and produce independent assessments.

The human decision between audit and execution is the highest-leverage point in the entire process. You review the findings, decide which ones to act on, define the scope, and only then allow write operations. This separation is what makes multi-agent orchestration safe, not just fast.

12.3 Parallelization Strategies

Running agents in parallel reduces wall-clock time. It also introduces the possibility of conflict. The strategies below manage the trade-off.

12.3.1 The One-File-One-Agent Rule

The most important parallelization rule is the simplest: within a single execution batch, no two agents may modify the same file.

Most agent tooling edits files using string matching: the agent specifies an exact block of text to find and replace. If Agent A modifies install.py and then Agent B tries to edit the same file, the text Agent B expects to find has already changed. The edit fails silently or produces corrupted output.

This is not a theoretical risk. It is the most common failure mode in parallel agent execution.

Pattern	Agent	Files	Risk
✅ Safe	Agent A	resolver.py, dependency_graph.py	None — distinct files
✅ Safe	Agent B	install.py	None — distinct files
✅ Safe	Agent C	cli.py, commands/init.py	None — distinct files
❌ Unsafe	Agent A	install.py (lines 100–200)	Conflict — Agent B’s line references invalidated
❌ Unsafe	Agent B	install.py (lines 400–500)	Conflict — after Agent A’s edits

Enforcing this rule requires partitioning the file set before dispatch. If two concerns both need to modify the same file, those modifications go to a single agent that handles both, or they go to separate sequential waves.

12.3.2 Wave-Based Parallelism

The wave model structures execution as a sequence of batches, where each batch runs in parallel and the entire batch completes before the next one starts.

flowchart TD
    W0["Wave 0: Foundation"]
    W0A["Types"]
    W0B["Utilities"]
    W0C["Config"]
    W0 --- W0A & W0B & W0C
    CHECK0{"Tests pass"}
    W0A & W0B & W0C --> CHECK0
    W1["Wave 1: Core"]
    CHECK0 --> W1
    W1D["Migration"]
    W1E["API"]
    W1F["Auth"]
    W1 --- W1D & W1E & W1F
    CHECK1{"Tests pass"}
    W1D & W1E & W1F --> CHECK1
    W2["Wave 2: Integration"]
    CHECK1 --> W2
    W2G["Wiring"]
    W2H["Tests"]
    W2 --- W2G & W2H

Figure 12.3: Wave-based parallelism with dependency ordering

The dependencies between waves are explicit. Wave 1 agents can rely on Wave 0’s output being committed and tested. Within a wave, agents are independent: they share no files and make no assumptions about each other’s progress.

Wave sizing matters. A wave with 2-3 agents completes in the time it takes the slowest agent to finish, typically 3-5 minutes. A wave with 8 agents still takes 8-10 minutes because the slowest agent dominates, but also increases the risk of failures that block the entire wave. Prefer more, smaller waves over fewer, larger ones.

In the PR #394 execution (case study), four waves plus one recovery wave handled 75 files in roughly 24 minutes of agent computation time. Wave sizes ranged from 1 to 5 agents. The parallelism saved approximately 21 minutes compared to sequential execution — based on an estimated 45 minutes of sequential agent time versus the 24 minutes of parallel agent computation time observed. This comparison is estimated, not measured — we did not run the sequential approach. The numbers reflect the author’s judgment based on prior single-agent attempts of similar scope. The real value was in reduced context degradation, not reduced time.

gantt
    title Wave Execution Timeline
    dateFormat X
    axisFormat %s

    section Wave 0 — Foundation
    Types              :a, 0, 5
    Utilities          :b, 0, 4
    Config             :c, 0, 3
    Checkpoint         :crit, t0, 5, 7

    section Wave 1 — Core
    Migration          :d, 7, 12
    API                :e, 7, 11
    Auth               :f, 7, 13
    Checkpoint         :crit, t1, 13, 15

    section Wave 2 — Integration
    Wiring             :g, 15, 19
    Tests              :h, 15, 18
    Checkpoint         :crit, t2, 19, 21

Figure 12.4: Wave execution timeline showing parallel agent dispatch

12.3.3 Pipeline Parallelism

Some operations can run in parallel across workflow stages rather than across files. While execution agents work on Wave 1, review agents can start reviewing Wave 0’s output. While human review happens on one wave, test agents can run extended validation on a previous wave.

gantt
    title Pipeline Parallelism
    dateFormat X
    axisFormat %s
    section Wave 0
    Agents    :a0, 0, 8
    Review    :r0, 8, 12
    Tests     :t0, 8, 18
    section Wave 1
    Agents    :a1, 10, 22
    Review    :r1, 22, 26

Figure 12.5: Pipeline parallelism: overlapping wave execution

This works when the review and test operations are read-only and the execution operations have no backward dependencies on review findings. If a review agent finds a problem in Wave 0, the fix goes into a later wave; it does not interrupt Wave 1, which was planned against the committed Wave 0 output.

12.4 Conflict Resolution

Despite careful partitioning, conflicts arise. They fall into three categories, each with a different resolution strategy.

12.4.1 File Conflicts

Two agents need to modify the same file in the same wave. This is a planning error, not a runtime error.

Resolution. Merge the two tasks into a single agent’s scope, or move one task to a later wave. If the modifications are to genuinely independent sections of a large file, a single agent can handle both sets of changes in one pass — it carries the context for both and applies edits sequentially.

Files that attract changes from multiple concerns are a signal. If install.py needs auth changes, logging changes, and type safety changes, that file is a coordination bottleneck. In the plan, assign it to a single agent per wave, even if that agent handles multiple concerns for that file.

12.4.2 Semantic Conflicts

Two agents produce output that is independently correct but mutually inconsistent. Agent A introduces a new error-handling pattern. Agent B, working on a different file, follows the old pattern because its instructions referenced the pre-change codebase.

Resolution. Foundation-before-migration wave ordering. Changes that establish new patterns (type definitions, utility functions, shared conventions) go in early waves. Changes that consume those patterns go in later waves. Agent B’s instructions reference the committed output of Wave 0, not the original codebase.

This is why the wave model requires testing and committing after each wave. The committed state after Wave 0 is the ground truth for Wave 1 agents. If you skip the commit, Wave 1 agents work against stale context and semantic conflicts multiply.

12.4.2.1 Semantic Conflict Recovery: A Walkthrough

Prevention is the ideal. But when a semantic conflict slips through wave ordering, you need a recovery procedure, not a principle. The PR #394 execution hit exactly this case. Here is what happened.

Wave 0 established a new OperationError type in the resolver module — a structured error with fields for error code, operation name, and a recoverable flag. This replaced the previous pattern of raising bare ValueError with message strings. The architecture agent committed the new type, updated the resolver, and all Wave 0 tests passed.

Wave 1 dispatched a domain agent to migrate logging in three command modules. The agent’s instructions referenced the committed Wave 0 codebase, but the dispatch prompt focused on logging patterns, not error handling. The domain agent correctly migrated all logging calls — and, in the process, added new error paths that used the old ValueError pattern. It had no reason to do otherwise. Its context included the logging migration guide, not the error-handling changes from Wave 0.

Detection. Wave 1’s unit tests passed. The individual files were correct in isolation. But the integration test suite — run at the wave checkpoint — caught mixed error types. Callers updated in Wave 0 now expected OperationError. The new error paths added in Wave 1 raised ValueError. Three integration tests failed with unhandled exception types.

Diagnosis. The orchestrator reviewed the failures and identified the root cause in under two minutes: the Wave 1 dispatch prompt loaded the logging context but not the error-handling context. The agent had no visibility into the pattern change.

Recovery. The fix was not to revert Wave 1. The logging migration was correct. Instead, the orchestrator dispatched Wave 2b — two targeted agents:

Agent 2b-A received the three command modules plus OperationError’s type definition. Its single task: replace every ValueError raise in the migrated files with the equivalent OperationError. Six files in context. Completed in 3 minutes.
Agent 2b-B updated the corresponding test files to assert on OperationError fields instead of exception message strings.

Wave 2b committed, all tests passed, and execution continued to Wave 3.

The lesson. Wave ordering prevents most semantic conflicts. When one slips through, the recovery follows a pattern: identify which context was missing from the dispatch, create a targeted recovery wave that carries the missing context, and fix forward rather than reverting. The recovery wave is small — scoped to exactly the files affected by the missing context — and fast, because the agents start with clean sessions and concentrated context.

The mistake to avoid: redispatching the entire original wave. The logging migration was 90% correct. A full redo wastes the work and risks introducing new issues. Surgical recovery waves are the right response to surgical failures.

12.4.3 Design Conflicts

Two agents, each following their specialization’s best practices, produce output that reflects genuinely different design philosophies. The architecture agent consolidates error handling into a central module. The domain agent keeps error handling local to each command because the domain’s UX conventions require command-specific error messages.

Resolution. This is an escalation to the human. Design conflicts are not bugs. They are trade-offs that require judgment. The plan’s priority-ordered principles resolve most of them mechanically: if UX is prioritized above architectural purity, the domain agent’s approach wins. When the principles don’t resolve the conflict, the human decides and documents the rationale.

The frequency of design conflicts is itself a metric. In the PR #394 execution (case study), 3 human interventions were needed across ~25 agent dispatches. None were design conflicts between agents; all were judgment calls that the plan could not automate. The intervention rate was approximately 12%. We use 15–20% as a starting hypothesis for well-planned work, though this has not been validated across multiple teams. These thresholds are calibration points from our reference case study, not validated benchmarks. Rates significantly above 20% may indicate underspecified plans. Rates below 5% warrant scrutiny — the work may be too simple for multi-agent orchestration, or review may be insufficient.

12.5 The Human as Orchestrator

In a multi-agent workflow, the human role shifts from producer to orchestrator. You do not write the code. You do not review every line. You make the decisions that agents cannot make for themselves: scope, priority, trade-offs, and when to stop.

This is not a passive role. It is a different kind of active.

12.5.1 What the Orchestrator Decides

Before execution. The orchestrator defines the plan: which concerns to address, which to defer, how to partition the work across agents and waves, and what principles govern trade-offs. This is the highest-leverage activity in the entire process. A well-structured plan with mediocre agents produces better results than a vague plan with excellent agents.

During execution. The orchestrator monitors progress and handles escalations. Most waves complete without intervention. When an agent gets stuck, the orchestrator diagnoses whether it is a prompt problem (refine the instructions and retry), a scope problem (split the task), or a tooling problem (work around the limitation). When agents produce conflicting output, the orchestrator resolves the conflict using the plan’s principles or makes an explicit design decision.

After execution. The orchestrator spot-checks critical changes, verifies test results, and decides whether the output meets the acceptance criteria. The level of detail in the review is proportional to the risk, not the volume. A 2,000-line diff where 1,800 lines are mechanical migration does not require reading all 2,000 lines. It requires verifying that the migration pattern is correct, that the 200 non-mechanical lines are sound, and that the test suite covers the behavior.

12.5.2 The Escalation Protocol

Not every problem requires human attention. A well-designed orchestration system handles most failures automatically. The four-level escalation protocol:

Level	Trigger	Response	Example
L1: Self-heal	Agent hits a test failure it can debug	Agent fixes and continues	Type error in generated code
L2: Retry	Agent produces incomplete output	Re-dispatch with refined prompt	Agent missed 3 of 12 files in scope
L3: Human decides	Trade-off between competing principles	Human makes design call	UX convention vs. architectural purity
L4: Scope change	Finding requires work outside the current plan	Human creates follow-up task	Discovery of a pre-existing bug unrelated to the change

The L1 and L2 levels are automated. L3 and L4 require human judgment. The goal is to minimize L3 and L4 interventions not by suppressing them, but by making the plan specific enough that most decisions resolve at L1 or L2.²

In the PR #394 execution (case study), the distribution across all decision points within the ~25 agent dispatches was roughly two-thirds autonomous (L1), one-eighth automated retry (L2), and one-fifth human decision (L3/L4) — three interventions during wave execution out of ~25 dispatches. That ~20% rate is characteristic of what we observed in a well-planned execution on a non-trivial change. If your L3+ rate exceeds 25%, the plan needs more specific principles or better task scoping.

12.6 The Coordination Tax: Honest Numbers

Multi-agent orchestration saves time through parallelism and improves quality through focused context. It also costs time through planning, monitoring, and intervention. Here is the honest accounting, based on the PR #394 execution (The APM Auth + Logging Overhaul).

In the PR #394 case, the human time in a well-planned multi-agent execution broke down into four activities: planning and partitioning (~30% of human time), monitoring execution (~20%), handling interventions (~25%), and post-execution review (~25%). Total human time was roughly 45 minutes against 24 minutes of agent computation time, with a total wave execution time of roughly 90 minutes because human work overlaps with agent execution.

The same change executed sequentially by a single agent would take an estimated 60-75 minutes of agent time — but with compounding context degradation after file 20. Based on similar single-agent attempts at this scale, expect 2-3 additional rework cycles to fix quality issues caused by context overload, adding 30-45 minutes. Total single-agent elapsed time: roughly 90-120 minutes with lower output quality.

The multi-agent approach did not save total elapsed time on this change. It traded human planning time for agent quality. The 45 minutes the orchestrator spent coordinating replaced the 30-45 minutes they would have spent debugging context-degraded output — with better results.

12.6.1 The Sweet Spot

Multi-agent orchestration pays for itself when:

File count exceeds 20 across 2+ concerns. Below this threshold, planning overhead exceeds the parallelism benefit. One agent with good instructions handles 15 files in a single concern faster than two agents with coordination overhead.
Concerns partition cleanly. If most files need changes from multiple concerns, you spend more time managing file ownership conflicts than you save through parallelism.
Context degradation is the real bottleneck. For changes that require deep architectural understanding — not mechanical find-and-replace — the quality benefit of focused context outweighs the coordination cost.
You will do this more than once. The first multi-agent orchestration on a codebase takes longest because you are building instruction files, learning partition boundaries, and developing intuition for wave sizing. The second takes half the planning time. By the third, the dispatch prompts are templates.

The overhead for a well-planned multi-agent execution is roughly 40-60% of total human time spent on coordination rather than direct value work. For a poorly planned execution — vague dispatch prompts, unclear file ownership, missing instruction files — based on the contrast between our reference execution and less-structured attempts, we estimate it can reach 70-80% or higher, at which point a single agent with good context would have been faster.

This is not a tool for every change. It is a tool for changes that exceed what a single agent can hold in focus. Know the threshold for your codebase, and do not orchestrate for the sake of orchestrating.

12.7 Session Management

Every agent dispatch creates a session: a context window with its own conversation history, loaded instructions, and accumulated state. Managing these sessions is a practical concern that directly affects output quality.

12.7.1 Session Isolation

Each agent session is independent. Agent A cannot see Agent B’s conversation history, edits, or reasoning. This is a feature, not a limitation. Session isolation ensures that one agent’s context degradation does not propagate to others. If Agent A’s session becomes cluttered after a complex debugging sequence, Agent B starts fresh with a clean context.

The implication: information flows between agents through committed artifacts, not through shared sessions. When Agent B needs to build on Agent A’s work, it reads the committed files — the same files that passed tests and were validated at the wave checkpoint. It does not read Agent A’s internal reasoning or discarded alternatives.

flowchart TD
    subgraph A1["Agent 1"]
        C1["Conversation<br/>(isolated)"]
    end
    subgraph A2["Agent 2"]
        C2["Conversation<br/>(isolated)"]
    end
    subgraph A3["Agent 3"]
        C3["Conversation<br/>(isolated)"]
    end
    FS["Shared filesystem<br/>(committed files = ground truth between waves)"]
    C1 --> FS
    C2 --> FS
    C3 --> FS

Figure 12.6: Session isolation: agents share filesystem, not conversation state

12.7.2 Session Lifetime

Shorter sessions produce better output. A session that has been running for 40 turns carries 40 turns of conversation history, which consumes context that could hold source code or instructions. The marginal value of each additional turn decreases as history accumulates.

Three guidelines for session lifetime:

One task per session. An agent dispatched to “migrate logging in resolver.py and update tests in test_resolver.py” is one task. Reusing that session for a second, unrelated task inherits the first task’s conversation history, dead weight for the second task.
Reset on failure. If an agent gets stuck — looping on the same error, producing the same incorrect output — terminate the session and dispatch a fresh one with refined instructions. The fresh session starts without the accumulated confusion of the failed attempt.
State through files, not memory. Any information that needs to survive across sessions must be written to the filesystem: committed code, plan documents, checkpoint records. Session-internal state (the agent’s reasoning, intermediate attempts, debugging output) is ephemeral and should be treated as such.

12.7.3 Cross-Session Coordination

The orchestration layer — whether a human with multiple terminal windows or an automated harness — maintains the coordination state that individual agent sessions cannot.

This state includes:

Task status. Which tasks are pending, in progress, complete, or blocked.
File ownership. Which agent is currently modifying which files — the enforcement mechanism for the one-file-one-agent rule.
Wave progress. Which waves have been completed and tested, which is currently executing.
Escalation log. What has been escalated, what decision was made, and why.

The agent sessions are stateless workers. The coordination layer is the stateful manager. Keeping this separation clean is what makes multi-agent orchestration predictable. When the coordination state is mixed into agent sessions — when an agent is asked to “track which files you’ve changed and tell the next agent” — the result is fragile and error-prone.

What we have been describing is the runtime layer of the agentic computing stack (Chapter 4). Just as an operating system manages processes, memory, and I/O for a CPU, the orchestration harness manages sessions, context loading, and file I/O for the LLM. The harness doesn’t do the thinking — it creates the conditions under which thinking produces reliable results.

12.8 Putting It Together

The patterns described in this chapter compose into a workflow:

flowchart TD
    A1["1. ASSESS<br/>Scope"] --> A2["2. SPECIALIZE<br/>Team"]
    A2 --> A3["3. PARTITION<br/>Files"]
    A3 --> A4["4. ORDER<br/>Waves"]
    A4 --> A5["5. DISPATCH<br/>Execute"]
    A5 --> A6{"6. VALIDATE<br/>Test & check"}
    A6 -->|"Pass"| A5
    A6 -->|"Fail"| DX["DIAGNOSE"]
    DX -->|"L1/L2"| RETRY["Retry"]
    DX -->|"L3/L4"| HUMAN["Human decides"]
    RETRY --> A5
    HUMAN --> A4

Figure 12.7: Complete multi-agent orchestration workflow

This is not a rigid process. A small change might skip straight from step 1 to dispatching a single agent. A large change might iterate through steps 5-6 four or five times. The structure is a decision framework, not a ceremony.

What matters is the discipline behind it: agents are specialized so their context is concentrated, files are partitioned so agents don’t conflict, waves are ordered so dependencies are satisfied, and the human makes the decisions that require judgment rather than the decisions that require typing.

The next chapter puts this framework into practice. It walks through the full five-phase meta-process — Audit, Plan, Wave, Validate, Ship — with the reference case study execution that used every pattern described here.

The coordination overhead of multi-agent systems parallels Brooks’s observation that adding developers to a late project makes it later (The Mythical Man-Month, 1975). The difference: agent coordination overhead is predictable and can be reduced through better planning, not just through better communication.↩︎
Our autonomy taxonomy draws conceptually from the SAE levels of driving automation (SAE J3016), adapted for software engineering contexts. The key parallel: higher autonomy levels require not less human involvement, but different kinds of involvement — supervision rather than operation.↩︎