sequenceDiagram
participant User
participant CLI as CLI (--verbose)
participant VL as verbose_log
participant RE as _rich_echo
User->>CLI: apm install pkg --verbose
CLI->>VL: verbose_log("Resolving...")
VL->>RE: _rich_echo("Resolving...")
RE-->>VL: NameError: not defined
VL-->>CLI: Exception silently swallowed
CLI->>User: (silent failure)
15 The APM Auth + Logging Overhaul
Scope: PR #394 against microsoft/apm — auth consolidation, logging abstraction, diagnostic collection
Duration: ~16 hours wall-clock across 2 sessions
Theme: Multi-agent orchestration against a production codebase — the methodology’s home turf
These four case studies test the methodology under progressively different conditions. The APM Overhaul applies multi-agent orchestration to a production codebase — engineering in its purest form. The Handbook Writing study tests whether the same patterns transfer to editorial composition. The Publishing Pipeline tests infrastructure automation. The Growth Engine tests non-engineering work where agent limitations become most visible. Each study documents what worked, what failed, and what the methodology could not do.
15.1 Orchestrating a 75-File Architecture Change Across 25 Agents
This case study documents a real Copilot CLI session against microsoft/apm (PR #394). Every metric comes from session checkpoint logs. Chapters 12–13 introduced this session’s structure. Here we focus on what went wrong, what the recovery looked like, and what it demonstrates about the methodology under production conditions.
- Budget context before you dispatch. Count call sites; split at ~25 per agent. 58 was too many.
- Verify filesystem, not self-reports.
difftarget files after every dispatch. Agent success messages are probabilistic output. - Expert panels are audits, not oracles. Panel findings must be validated before they become wiring instructions.
- Expand scope through the plan gate. Mid-planning expansion is healthy. Mid-wave expansion is dangerous.
- Checkpoints must assert behaviour. 2,829 passing tests did not catch a silent
NameError. Assert on observable output.
15.2 The Problem
A user reported confusing UX when apm install failed for a GitHub Enterprise Managed Users (EMU) organization package. Investigation revealed a single root cause branching into three systemic failures:
- Auth bypass.
_validate_package_exists()ran baregit ls-remotewithout credentials —GITHUB_APM_PATwas ignored for github.com hosts. - Auth fragmentation. Four inconsistent auth implementations were scattered across install, download, copilot, and operations modules.
- Observability gap. 766 ad-hoc
_rich_*logging calls across 27 files with no shared abstraction. When verbose logging silently failed (aNameErrorcaught by an outertry/except), there was no way to know.
A point fix would have patched _validate_package_exists(). The structural fix required centralised auth resolution, a command-logger abstraction, and diagnostic collection — touching 75 files.
15.3 Expert Panel and Plan Evolution
The session operated at three scales: a 6-expert audit panel for diagnosis, a fleet of ~25 agents for implementation, and individual agents for targeted fixes.
Six experts ran in parallel during the audit phase: GitHub auth patterns, EMU constraints, Azure DevOps auth, architecture design, CLI UX, and documentation. Each produced severity-ranked findings. The EMU Expert identified that host-gating ghu_ tokens was incorrect (later validated as Escalation #3). The Architecture Expert proposed Strategy + Chain of Responsibility as the auth consolidation pattern. The CLI UX Expert found 766 ad-hoc logging calls with no shared abstraction. All findings were synthesised into a single source-of-truth document.
15.3.1 Eight Plan Iterations
The plan went through eight iterations before approval. This is not a failure — it is the meta-process working as designed (Ch13 §Plan).
| Version | Scope | Trigger |
|---|---|---|
| v1 | 10 UX message fixes | Initial triage |
| v2 | Removed v0.8.2-only bug | User correction |
| v3 | Added auth gap as root cause | Expert panel findings |
| v4 | Unauth-first → auth-fallback | Architecture expert |
| v5 | Architecture-first, 4 phases | Orchestrator restructure |
| v6 | Added agent/skill primitives | Instrumented codebase needs |
| v7 | ALL commands covered | User escalation (L4) |
| v8 ✓ | 25 todos, 47 files, 5 phases | Approved |
Version 7 was the critical turn. The user rejected a plan scoped to install only and demanded every command route through AuthResolver and CommandLogger. The orchestrator dispatched three more explore agents to audit all logging calls (766+), all auth touchpoints (95+), and per-dependency auth paths. This is Scope Creep (Anti-pattern #5) — but handled correctly. The scope expanded before execution began, through the plan gate, not mid-wave.
The plan targeted 47 primary source files. The final PR touched 75 — the additional 28 were test files, configuration updates, and documentation changes discovered during execution. This is typical of dependency-following refactors.
15.4 Wave Execution
The plan decomposed into five phases. Each wave followed checkpoint discipline — full test suite after every wave, commit before the next. Tests climbed from 2,829 to 2,897 across five waves.
| Wave | Agents | Scope | Checkpoint |
|---|---|---|---|
| Foundation | 3 parallel | AuthResolver, CommandLogger, DiagnosticCollector | 2,839 tests |
| Auth wiring | 8 parallel | One file per agent (downloader, install, copilot, operations, errors) | 2,846 tests |
| Logger wiring | 7 parallel | All commands through CommandLogger. install.py (58 calls) got stuck → escalated | 2,874 tests |
| Tests | parallel | 78 unit + 26 integration + 11 diagnostics | 2,897 tests |
| Ship | sequential | Docs, skills, CHANGELOG, PR review fixes | Released as v0.8.4 |
115 new tests were written (78 unit, 26 integration, 11 diagnostics); 47 legacy tests were consolidated or replaced during the refactor, for a net gain of 68.
The one-file-one-agent-per-wave rule (Ch12) prevented merge conflicts across all parallel dispatches. The one exception — install.py in the logger wave — is Escalation #1 below.
15.5 Escalation Events
Five escalations occurred. Each maps to a specific anti-pattern.
Escalation severity follows a three-tier model:
| Level | Meaning | Action |
|---|---|---|
| L2 | Agent needs guidance | Orchestrator adjusts prompt and re-dispatches |
| L3 | Agent cannot complete | Orchestrator takes over the task manually |
| L4 | Plan scope changes | New todos added, potentially new wave |
15.5.1 1. The install.py Agent (Anti-pattern #11: Context Window Exhaustion)
The agent migrating install.py (58 _rich_* calls, the largest single file) ran for 45+ minutes and stopped producing coherent edits. The context window filled with its own prior output — a textbook case of Context Window Exhaustion.
Recovery. The orchestrator escalated to L3: wrote a Python script to strip 30 dead else-branch fallbacks, manually fixed 3 duplicate calls, and committed. The context budgeting lesson: 58 call sites in one dispatch is too many. The file should have been split across two waves — structural calls in Wave 3a, verbose calls in Wave 3b.
15.5.2 2. Unicode Agent Persistence (Anti-pattern #12: Hallucinated Edits)
The Wave 3 unicode cleanup agent reported all replacements complete. File inspection showed zero changes — the agent had written to a temporary copy. This is Hallucinated Edits: the agent’s self-report diverged from filesystem state.
Recovery. Orchestrator performed all replacements manually: ✓→[+], ✗→[x], ⚠→[!], →→->, —→-- across 4 files. The Trust Fall (Anti-pattern #7) was also in play — the orchestrator initially accepted the agent’s success report without file verification.
15.5.3 3. Token Type Correction (Anti-pattern #13: Stale Context Between Waves)
The expert panel classified ghu_ tokens as EMU-specific. The user corrected: ghu_ is OAuth; EMU users receive standard ghp_/github_pat_ tokens. The security constraint host-gating global env vars was built on stale expert output — Stale Context Between Waves.
Recovery. Orchestrator updated AuthResolver to remove the incorrect host-gating logic and re-ran auth tests.
15.5.4 4. Fine-Grained PAT 403 Failure (L4 — Plan scope change)
Auth still failed with a valid fine-grained PAT. Root cause: x-access-token:{token}@host URL format sends Basic auth, which GitHub rejects for fine-grained PATs. The plan assumed git ls-remote would work for all token types.
Recovery. Pivoted validation entirely from git ls-remote to the GitHub REST API — a single code path that works for all token types. This was an L4 escalation — scope expanded beyond the original plan. The orchestrator chose one validation strategy over a branching tree of token-type logic, avoiding complexity at the architecture level.
15.5.5 5. Verbose Logging Silent NameError
_rich_echo was never imported in install.py. The verbose_log lambda triggered a NameError caught by an outer try/except. Verbose mode silently did nothing. No test caught it because the test suite never asserted on verbose output.
This is why checkpoint discipline matters (Ch12) — and why it must include behavioural assertions, not just “tests pass.” The test suite had 2,829 passing tests and none verified that verbose mode actually produced output.
After any agent reports success, verify filesystem state directly — never trust self-reports:
# After agent claims "all unicode replacements complete":
git diff --stat # Did any files actually change?
grep -rn "✓\|✗\|⚠" src/ # Are the old characters still there?Agent success messages are probabilistic output. The diff command is deterministic. Build this check into every checkpoint.
15.6 Anti-Pattern Mapping
| Escalation | Anti-pattern | PROSE Constraint | Resolution |
|---|---|---|---|
| install.py stuck | #11 Context Window Exhaustion | Progressive Disclosure | Split file across waves; manual completion |
| Unicode persistence | #12 Hallucinated Edits | Safety Boundaries | File-state verification after every dispatch |
| Unicode persistence | #7 The Trust Fall | Safety Boundaries | Never accept self-report without diff check |
| Token type error | #13 Stale Context Between Waves | Progressive Disclosure | Re-validate expert findings before wiring |
| PAT 403 | #5 Scope Creep | Reduced Scope | L4 escalation through plan gate |
| Silent NameError | #9 Skipping Checkpoints | Safety Boundaries | Assert on observable behaviour, not just pass/fail |
15.7 What Held True Regardless of the Model
Ch15 defines five properties that hold regardless of model capability. This session tested three under production conditions.
“Context will remain finite and fragile.” The install.py agent proved it. 58 call sites exhausted the context window. No amount of model improvement eliminates the need for context budgeting — partitioning work to fit the window with room for reasoning.
“Output will remain probabilistic.” The unicode agent reported success on changes it never persisted. The same prompt, re-dispatched, might have worked. Reliability was architectured through checkpoint discipline and file-state verification — not by trusting any single execution.
“Human judgment will remain the bottleneck and the differentiator.” The user’s v7 escalation — demanding all commands be covered, not just install — was the highest-leverage decision in the session. No agent suggested it. The 8 plan iterations were not wasted work; they were the mechanism through which human judgment shaped the architecture.
The evidence chain is inspectable at PR #394. The five escalations above are the full log.
| Metric | Value | Notes |
|---|---|---|
| Files changed | 75 | 47 planned + 28 dependency/test/config |
| Lines changed | +7,832/−1,074 | |
| Tests before | 2,829 | Baseline at start |
| Tests after | 2,897 | +115 written, −47 consolidated = +68 net |
| Agent dispatches | ~25 | Across 5 waves in 5 phases |
| Audit agents | 6 | Expert panel phase |
| Plan iterations | 8 | v1–v8, approved at v8 |
| Escalations | 5 | 1×L2, 2×L3, 2×L4 |
| Execution interventions | 3 | During wave execution (L3/L4 escalations) |
| Verification interventions | 2 | During review/validation |
| Wave execution time | ~90 minutes | Active agent + human review time |
| Total wall-clock time | ~16 hours | 2 sessions including planning, monitoring, breaks |
| Human time breakdown | ~30% planning, ~20% monitoring, ~25% interventions, ~25% review | Approximate, single execution |
Other chapters reference these metrics. “~90 minutes” refers to wave execution time; “~16 hours” refers to total elapsed time including planning and breaks.
In the author’s experience, similar-scope manual refactors (consolidating authentication patterns across 40+ files) have taken 3–5 days of focused engineering time. This estimate has not been formally benchmarked.