15  The APM Auth + Logging Overhaul

Scope: PR #394 against microsoft/apm — auth consolidation, logging abstraction, diagnostic collection
Duration: ~16 hours wall-clock across 2 sessions
Theme: Multi-agent orchestration against a production codebase — the methodology’s home turf

NotePart IV: Case Studies

These four case studies test the methodology under progressively different conditions. The APM Overhaul applies multi-agent orchestration to a production codebase — engineering in its purest form. The Handbook Writing study tests whether the same patterns transfer to editorial composition. The Publishing Pipeline tests infrastructure automation. The Growth Engine tests non-engineering work where agent limitations become most visible. Each study documents what worked, what failed, and what the methodology could not do.


15.1 Orchestrating a 75-File Architecture Change Across 25 Agents

Note

This case study documents a real Copilot CLI session against microsoft/apm (PR #394). Every metric comes from session checkpoint logs. Chapters 12–13 introduced this session’s structure. Here we focus on what went wrong, what the recovery looked like, and what it demonstrates about the methodology under production conditions.

TipFive Lessons — The TL;DR
  1. Budget context before you dispatch. Count call sites; split at ~25 per agent. 58 was too many.
  2. Verify filesystem, not self-reports. diff target files after every dispatch. Agent success messages are probabilistic output.
  3. Expert panels are audits, not oracles. Panel findings must be validated before they become wiring instructions.
  4. Expand scope through the plan gate. Mid-planning expansion is healthy. Mid-wave expansion is dangerous.
  5. Checkpoints must assert behaviour. 2,829 passing tests did not catch a silent NameError. Assert on observable output.

15.2 The Problem

A user reported confusing UX when apm install failed for a GitHub Enterprise Managed Users (EMU) organization package. Investigation revealed a single root cause branching into three systemic failures:

  1. Auth bypass. _validate_package_exists() ran bare git ls-remote without credentials — GITHUB_APM_PAT was ignored for github.com hosts.
  2. Auth fragmentation. Four inconsistent auth implementations were scattered across install, download, copilot, and operations modules.
  3. Observability gap. 766 ad-hoc _rich_* logging calls across 27 files with no shared abstraction. When verbose logging silently failed (a NameError caught by an outer try/except), there was no way to know.

A point fix would have patched _validate_package_exists(). The structural fix required centralised auth resolution, a command-logger abstraction, and diagnostic collection — touching 75 files.


15.3 Expert Panel and Plan Evolution

The session operated at three scales: a 6-expert audit panel for diagnosis, a fleet of ~25 agents for implementation, and individual agents for targeted fixes.

Six experts ran in parallel during the audit phase: GitHub auth patterns, EMU constraints, Azure DevOps auth, architecture design, CLI UX, and documentation. Each produced severity-ranked findings. The EMU Expert identified that host-gating ghu_ tokens was incorrect (later validated as Escalation #3). The Architecture Expert proposed Strategy + Chain of Responsibility as the auth consolidation pattern. The CLI UX Expert found 766 ad-hoc logging calls with no shared abstraction. All findings were synthesised into a single source-of-truth document.

15.3.1 Eight Plan Iterations

The plan went through eight iterations before approval. This is not a failure — it is the meta-process working as designed (Ch13 §Plan).

Version Scope Trigger
v1 10 UX message fixes Initial triage
v2 Removed v0.8.2-only bug User correction
v3 Added auth gap as root cause Expert panel findings
v4 Unauth-first → auth-fallback Architecture expert
v5 Architecture-first, 4 phases Orchestrator restructure
v6 Added agent/skill primitives Instrumented codebase needs
v7 ALL commands covered User escalation (L4)
v8 ✓ 25 todos, 47 files, 5 phases Approved

Version 7 was the critical turn. The user rejected a plan scoped to install only and demanded every command route through AuthResolver and CommandLogger. The orchestrator dispatched three more explore agents to audit all logging calls (766+), all auth touchpoints (95+), and per-dependency auth paths. This is Scope Creep (Anti-pattern #5) — but handled correctly. The scope expanded before execution began, through the plan gate, not mid-wave.

The plan targeted 47 primary source files. The final PR touched 75 — the additional 28 were test files, configuration updates, and documentation changes discovered during execution. This is typical of dependency-following refactors.


15.4 Wave Execution

The plan decomposed into five phases. Each wave followed checkpoint discipline — full test suite after every wave, commit before the next. Tests climbed from 2,829 to 2,897 across five waves.

Wave Agents Scope Checkpoint
Foundation 3 parallel AuthResolver, CommandLogger, DiagnosticCollector 2,839 tests
Auth wiring 8 parallel One file per agent (downloader, install, copilot, operations, errors) 2,846 tests
Logger wiring 7 parallel All commands through CommandLogger. install.py (58 calls) got stuck → escalated 2,874 tests
Tests parallel 78 unit + 26 integration + 11 diagnostics 2,897 tests
Ship sequential Docs, skills, CHANGELOG, PR review fixes Released as v0.8.4

115 new tests were written (78 unit, 26 integration, 11 diagnostics); 47 legacy tests were consolidated or replaced during the refactor, for a net gain of 68.

The one-file-one-agent-per-wave rule (Ch12) prevented merge conflicts across all parallel dispatches. The one exception — install.py in the logger wave — is Escalation #1 below.


15.5 Escalation Events

Five escalations occurred. Each maps to a specific anti-pattern.

Escalation severity follows a three-tier model:

Level Meaning Action
L2 Agent needs guidance Orchestrator adjusts prompt and re-dispatches
L3 Agent cannot complete Orchestrator takes over the task manually
L4 Plan scope changes New todos added, potentially new wave

15.5.1 1. The install.py Agent (Anti-pattern #11: Context Window Exhaustion)

The agent migrating install.py (58 _rich_* calls, the largest single file) ran for 45+ minutes and stopped producing coherent edits. The context window filled with its own prior output — a textbook case of Context Window Exhaustion.

Recovery. The orchestrator escalated to L3: wrote a Python script to strip 30 dead else-branch fallbacks, manually fixed 3 duplicate calls, and committed. The context budgeting lesson: 58 call sites in one dispatch is too many. The file should have been split across two waves — structural calls in Wave 3a, verbose calls in Wave 3b.

15.5.2 2. Unicode Agent Persistence (Anti-pattern #12: Hallucinated Edits)

The Wave 3 unicode cleanup agent reported all replacements complete. File inspection showed zero changes — the agent had written to a temporary copy. This is Hallucinated Edits: the agent’s self-report diverged from filesystem state.

Recovery. Orchestrator performed all replacements manually: [+], [x], [!], ->, -- across 4 files. The Trust Fall (Anti-pattern #7) was also in play — the orchestrator initially accepted the agent’s success report without file verification.

15.5.3 3. Token Type Correction (Anti-pattern #13: Stale Context Between Waves)

The expert panel classified ghu_ tokens as EMU-specific. The user corrected: ghu_ is OAuth; EMU users receive standard ghp_/github_pat_ tokens. The security constraint host-gating global env vars was built on stale expert output — Stale Context Between Waves.

Recovery. Orchestrator updated AuthResolver to remove the incorrect host-gating logic and re-ran auth tests.

15.5.4 4. Fine-Grained PAT 403 Failure (L4 — Plan scope change)

Auth still failed with a valid fine-grained PAT. Root cause: x-access-token:{token}@host URL format sends Basic auth, which GitHub rejects for fine-grained PATs. The plan assumed git ls-remote would work for all token types.

Recovery. Pivoted validation entirely from git ls-remote to the GitHub REST API — a single code path that works for all token types. This was an L4 escalation — scope expanded beyond the original plan. The orchestrator chose one validation strategy over a branching tree of token-type logic, avoiding complexity at the architecture level.

15.5.5 5. Verbose Logging Silent NameError

_rich_echo was never imported in install.py. The verbose_log lambda triggered a NameError caught by an outer try/except. Verbose mode silently did nothing. No test caught it because the test suite never asserted on verbose output.

sequenceDiagram
    participant User
    participant CLI as CLI (--verbose)
    participant VL as verbose_log
    participant RE as _rich_echo

    User->>CLI: apm install pkg --verbose
    CLI->>VL: verbose_log("Resolving...")
    VL->>RE: _rich_echo("Resolving...")
    RE-->>VL: NameError: not defined
    VL-->>CLI: Exception silently swallowed
    CLI->>User: (silent failure)
Figure 15.1: Sequence showing how a NameError was silently swallowed

This is why checkpoint discipline matters (Ch12) — and why it must include behavioural assertions, not just “tests pass.” The test suite had 2,829 passing tests and none verified that verbose mode actually produced output.


TipTry This: Filesystem Verification After Agent Dispatch

After any agent reports success, verify filesystem state directly — never trust self-reports:

# After agent claims "all unicode replacements complete":
git diff --stat           # Did any files actually change?
grep -rn "✓\|✗\|⚠" src/  # Are the old characters still there?

Agent success messages are probabilistic output. The diff command is deterministic. Build this check into every checkpoint.


15.6 Anti-Pattern Mapping

Escalation Anti-pattern PROSE Constraint Resolution
install.py stuck #11 Context Window Exhaustion Progressive Disclosure Split file across waves; manual completion
Unicode persistence #12 Hallucinated Edits Safety Boundaries File-state verification after every dispatch
Unicode persistence #7 The Trust Fall Safety Boundaries Never accept self-report without diff check
Token type error #13 Stale Context Between Waves Progressive Disclosure Re-validate expert findings before wiring
PAT 403 #5 Scope Creep Reduced Scope L4 escalation through plan gate
Silent NameError #9 Skipping Checkpoints Safety Boundaries Assert on observable behaviour, not just pass/fail

15.7 What Held True Regardless of the Model

Ch15 defines five properties that hold regardless of model capability. This session tested three under production conditions.

“Context will remain finite and fragile.” The install.py agent proved it. 58 call sites exhausted the context window. No amount of model improvement eliminates the need for context budgeting — partitioning work to fit the window with room for reasoning.

“Output will remain probabilistic.” The unicode agent reported success on changes it never persisted. The same prompt, re-dispatched, might have worked. Reliability was architectured through checkpoint discipline and file-state verification — not by trusting any single execution.

“Human judgment will remain the bottleneck and the differentiator.” The user’s v7 escalation — demanding all commands be covered, not just install — was the highest-leverage decision in the session. No agent suggested it. The 8 plan iterations were not wasted work; they were the mechanism through which human judgment shaped the architecture.


The evidence chain is inspectable at PR #394. The five escalations above are the full log.

NoteCanonical Metrics — PR #394
Metric Value Notes
Files changed 75 47 planned + 28 dependency/test/config
Lines changed +7,832/−1,074
Tests before 2,829 Baseline at start
Tests after 2,897 +115 written, −47 consolidated = +68 net
Agent dispatches ~25 Across 5 waves in 5 phases
Audit agents 6 Expert panel phase
Plan iterations 8 v1–v8, approved at v8
Escalations 5 1×L2, 2×L3, 2×L4
Execution interventions 3 During wave execution (L3/L4 escalations)
Verification interventions 2 During review/validation
Wave execution time ~90 minutes Active agent + human review time
Total wall-clock time ~16 hours 2 sessions including planning, monitoring, breaks
Human time breakdown ~30% planning, ~20% monitoring, ~25% interventions, ~25% review Approximate, single execution

Other chapters reference these metrics. “~90 minutes” refers to wave execution time; “~16 hours” refers to total elapsed time including planning and breaks.

In the author’s experience, similar-scope manual refactors (consolidating authentication patterns across 40+ files) have taken 3–5 days of focused engineering time. This estimate has not been formally benchmarked.

📕 Get the PDF & EPUB — free download

Plus ~1 update/month max. No spam. Unsubscribe anytime.

Download the Handbook

CC BY-NC-ND 4.0 © 2025-2026 Daniel Meppiel · CC BY-NC-ND 4.0

Free to read and share with attribution. License details