9  The Instrumented Codebase

Open any repository that uses AI agents reliably and you’ll notice something before you read a single line of application code: the project is full of markdown files that aren’t documentation. They’re instruction sets, agent configurations, skill definitions, workflow templates, memory files, and specification blueprints, scattered through .github/ directories and woven into the source tree. These files don’t ship to production. They don’t appear in the build output. But they determine whether an AI agent produces code that respects the project’s architecture or code that confidently violates it.

This chapter catalogs those files. It defines seven primitive types, shows what each looks like, explains how they compose, and walks through the transformation of an uninstrumented repository into one that’s ready for agentic development. Chapter 10 specifies the constraints these primitives implement. Chapter 11 teaches the context engineering discipline that makes them effective. This chapter answers the prior question: what, specifically, are you building?


9.1 What Instrumentation Means

An instrumented codebase is one that has externalized its tacit knowledge into machine-readable artifacts.

Every mature project carries two kinds of knowledge. The first kind is in the code itself (types, function signatures, directory structure, test assertions). Any agent can read this. The second kind is in the team’s heads: which authentication pattern is current and which is deprecated, why the logging module uses a custom wrapper instead of the standard library, what “follows the BaseIntegrator pattern” means in practice, why that one directory has different import rules than every other. An agent cannot read this. It will guess, and it will guess wrong.

Instrumentation is the practice of converting the second kind of knowledge into structured files that an agent loads as context. The files are markdown. They version-control alongside the code they describe. They’re reviewed in pull requests. And they create a feedback loop: when an agent makes a mistake, you don’t fix the generated code; you fix the context file that failed to prevent the mistake.

The term “primitive” is deliberate. These artifacts are the atomic units of agentic behavior. Like functions in code, they do one thing, they compose with other context files, and they’re testable in isolation. Unlike prompts typed into a chat window and forgotten, these artifacts persist, accumulate value, and improve through iteration.


9.2 The Seven Primitive Types

Seven categories cover the full range of knowledge an agent needs. Each addresses a distinct gap between what’s in the code and what an agent needs to know. Not every project needs all seven on day one (the instrumentation audit later in this chapter helps you decide where to start) but understanding the complete set is necessary before making that decision.

block-beta
    columns 4
    A["Instructions<br/>.instructions.md<br/>─<br/>Scoped conventions<br/>per file/directory"]:1
    B["Agents<br/>.agent.md<br/>─<br/>Specialist personas<br/>with tool boundaries"]:1
    C["Skills<br/>SKILL.md<br/>─<br/>Reusable decision<br/>frameworks"]:1
    D["Prompts<br/>.prompt.md<br/>─<br/>Repeatable<br/>workflows"]:1
    E["Memory<br/>.memory.md<br/>─<br/>Cross-session<br/>knowledge"]:1
    F["Orchestration<br/>.spec.md<br/>─<br/>Execution-ready<br/>specifications"]:1
    G["Hooks<br/>event-driven<br/>─<br/>Automated actions<br/>on dev events"]:1
Figure 9.1: The seven APM primitive types

9.2.1 Instructions

Purpose: Encode project conventions scoped to specific files, directories, or file types. Instructions are the most granular context artifact; they tell an agent “when you touch code in this scope, follow these rules.”

File format: .instructions.md with frontmatter specifying scope.

---
applyTo: "src/api/**"
description: "API layer conventions for endpoint implementation"
---

# API Development Rules

## Middleware Registration
- All middleware decorators are registered in `middleware.py`, never inline on routes.
- Route files define endpoint logic only.

## Rate Limiting
- Use `app.rate_limiter.RateLimiter`, not third-party libraries.
  The internal implementation integrates with the metrics pipeline.
- Rate limit values come from environment variables, never hardcoded.

## Error Responses
- All error responses use `APIError.from_exception()` for consistent format.
- Never return raw exception messages to clients.

Design test: Can you state the scope in one applyTo pattern? Does every rule in the file apply to that scope? If you’re writing rules that apply to two unrelated domains, split the file. If you can’t express the scope as a glob, the knowledge probably belongs in an agent configuration or a skill instead.

What distinguishes a good instruction file from a bad one: length. If your instruction file exceeds 40-50 lines, it’s trying to do too much. The reason is mechanical: every line of instruction competes for attention with the source code the agent needs to read. A 200-line instruction file doesn’t give an agent more to work with. It gives it more to get lost in.

9.2.2 Agents

Purpose: Define specialist personas with domain expertise, calibrated judgment, and explicit behavioral boundaries. An agent configuration is the answer to “who should work on this?”, not in terms of a human team member, but in terms of what expertise, priorities, and constraints the task requires.

File format: .agent.md with frontmatter specifying the model, tools, and description.

---
description: "Backend architecture specialist for Python services"
tools: ["changes", "codebase", "editFiles", "runCommands",
        "search", "problems", "testFailure"]
model: claude-sonnet-4.5
---
# Python Architect

You are an expert Python architect specializing in CLI tool design
and modular service architecture.

## Design Philosophy
- Speed and simplicity over complexity
- Solid foundation that can be iterated on
- Operations proportional to affected files, not workspace size

## Patterns You Enforce
- BaseIntegrator for all file-level integrators
- CommandLogger for all CLI output
- AuthResolver for all credential access

## You Never
- Add a new base class when an existing one can be extended
- Instantiate AuthResolver per-request (it is a singleton)
- Import from integration/ in the CLI layer (use the public API)

Four elements make an agent configuration effective:

  1. Domain expertise. Specific enough to constrain the agent’s decisions. “You are an expert Python developer” is too broad to be useful. “You specialize in CLI tool design using the Click framework with Rich terminal output” constrains the solution space meaningfully.
  2. Named patterns. When the agent knows patterns by name (“BaseIntegrator,” “CredentialChain,” “CommandLogger”) it can reference them in its reasoning and produce code that uses them correctly.
  3. Anti-patterns. What the agent must never do. These encode institutional memory; each item represents a mistake that happened at least once and cost the team time to fix.
  4. Tool boundaries. Which tools the agent can invoke. A documentation agent shouldn’t execute destructive commands. A frontend agent shouldn’t access backend databases. Tool boundaries are safety boundaries made concrete.

Start with three to five agent configurations. An architect, a domain expert for your core business logic, and a documentation writer cover most tasks. Add configurations when you observe repeated corrections; that’s the signal that a new specialization has earned its place.

9.2.3 Skills

Purpose: Package reusable decision frameworks that activate based on code patterns. A skill is more than a set of rules; it teaches an agent how to think about a specific domain.

File format: A directory containing a SKILL.md file, optionally with examples and supporting context.

.github/skills/
└── cli-logging-ux/
    ├── SKILL.md
    └── examples/
        ├── good-warning.py
        └── bad-warning.py
---
name: cli-logging-ux
description: >
  Activate whenever code touches console helpers, DiagnosticCollector,
  STATUS_SYMBOLS, or any user-facing terminal output.
---

# CLI Logging UX

## Decision Framework

### The "So What?" Test
Every warning must answer: what should the user do about this?
A warning without a suggested action is noise.

### The Traffic Light Rule
| Color  | Helper           | Meaning            |
|--------|------------------|--------------------|
| Green  | _rich_success()  | Completed          |
| Yellow | _rich_warning()  | User action needed |
| Red    | _rich_error()    | Cannot continue    |
| Blue   | _rich_info()     | Status update      |

### The Newspaper Test
Can a user scan the output like headlines?
If they have to read paragraphs to understand status, restructure.

## Anti-Patterns
- Never use bare print() or click.echo() without styling
- Never emit a warning without an actionable suggestion
- Never mix Rich and colorama in the same output path

The design test for a skill is three criteria: Does this knowledge apply across multiple files? Does it require more than a few rules to express? Is it triggered by a detectable code pattern? If all three are yes, it’s a skill. If the knowledge applies to a single directory, it’s an instruction. If it’s a general disposition, it’s an agent configuration.

Skills differ from instructions in an important way: they provide decision frameworks, not just rules. A rule says “use _rich_warning() for warnings.” A decision framework says “every warning must answer ‘what should the user do about this?’” The framework generalizes to situations the author didn’t anticipate. Rules cover known cases. Frameworks cover unknown ones.

9.2.4 Prompts

Purpose: Define reusable, parameterized workflows that orchestrate multi-step tasks. A prompt is a repeatable process, the agentic equivalent of a script or a makefile target.

File format: .prompt.md with frontmatter specifying execution mode and tools.

---
mode: agent
description: "Code review workflow with security and architecture checks"
---

# Structured Code Review

## Context Loading
1. Read the [project architecture](../../docs/architecture.md)
2. Review the diff to understand scope of changes
3. Check [security guidelines](../../docs/security.md) for relevant patterns

## Review Phases

### Phase 1: Correctness
- Does the code do what the PR description claims?
- Are edge cases handled?
- Do tests cover the new behavior?

### Phase 2: Architecture
- Does this change respect existing module boundaries?
- Are new dependencies justified and minimal?
- Would a senior engineer on this team approve the approach?

### Phase 3: Security
- Is user input validated at the boundary?
- Are credentials handled through the standard resolver?
- Could this change introduce injection, traversal, or leakage?

## Output Format
Provide findings grouped by severity (Critical / High / Medium / Low).
For each finding: file, line, what's wrong, what to do instead.

Prompts are the bridge between ad-hoc chat and systematic workflows. Without them, every time a developer wants an agent to perform a code review, they type a slightly different request and get slightly different quality. With a prompt file, the process is consistent: the same phases, the same checks, the same output format. Quality becomes reproducible.

9.2.5 Memory

Purpose: Preserve knowledge across sessions. Agents are stateless; every conversation starts from zero. Memory files give them access to accumulated decisions, resolved trade-offs, and project history that would otherwise vanish between sessions.

File format: .memory.md, structured by domain.

# Project Decisions

## Authentication (last updated: 2025-06-15)
- Migrated from session-based to JWT auth in Q1 2025
- Token refresh uses exponential backoff, max 3 retries
- EMU (Enterprise Managed User) tokens use `ghu_` prefix ← INCORRECT, agent error
- CORRECTION: EMU tokens use standard PAT prefixes (`github_pat_` / `ghp_`). `ghu_` is OAuth.
- The `SessionAuth` class is deprecated but not yet removed.
  Do NOT use it for new code. Migration tracked in JIRA-4521.

## API Versioning (last updated: 2025-05-20)
- v1 endpoints frozen. No new features, security fixes only.
- v2 is the active version. All new endpoints go here.
- Versioning is URL-based (/v1/, /v2/), not header-based.
  This was debated and decided in ADR-017.

## Performance Decisions (last updated: 2025-07-01)
- Database connection pooling uses pgbouncer, not application-level pools.
- Cache invalidation is event-driven (via message queue), not TTL-based.
- The /api/search endpoint has a 5-second timeout. This is intentional —
  longer queries must use the async search endpoint.

Memory files capture the knowledge that doesn’t fit in instructions because it isn’t a rule; it’s context. The distinction matters: “use JWT for authentication” is a rule (instruction). “We migrated from sessions to JWT in Q1, and the old SessionAuth class is still in the code but deprecated” is context (memory). An agent that knows only the rule might accidentally use the deprecated class. An agent that also has the memory won’t.

Memory files are the most likely primitive to drift from reality. Include dates. Review them quarterly. If a section hasn’t been updated in six months, verify that it’s still accurate or remove it.

NoteMemory Storage Varies by Tool

While this chapter describes memory as markdown files for portability, some tools implement memory as structured databases. GitHub Copilot, for example, stores memory in a database system rather than flat files. The principles are the same (persistence, discoverability, and staleness management) regardless of the storage mechanism.

9.2.6 Orchestration

Purpose: Bridge planning and implementation by defining structured specifications that can be executed by humans or agents with the same precision. Orchestration files decompose large features into implementation-ready units.

File format: .spec.md for specifications, or workflow composition files that coordinate multiple context files.

# Feature: Rate Limiting for Public API

## Problem Statement
Public API endpoints have no rate limiting. A single client can
exhaust server resources with sustained high-frequency requests.

## Approach
Implement per-client rate limiting using the internal RateLimiter
with Redis-backed sliding window counters.

## Implementation Requirements

### Components
- [ ] Rate limiter middleware (`src/middleware/rate_limiter.py`)
- [ ] Redis counter service (`src/services/rate_counter.py`)
- [ ] Configuration loader (`src/config/rate_limits.yaml`)

### API Contracts
- 429 response with `Retry-After` header when limit exceeded
- `X-RateLimit-Remaining` header on every response
- Per-endpoint configuration via environment variables

### Validation Criteria
- [ ] Handles concurrent requests without race conditions
- [ ] Sliding window accurate within 1-second granularity
- [ ] Unit tests > 90% coverage on counter logic
- [ ] Load test: 10K requests/second without limiter degradation

Specification files matter because they make the “Reduced Scope” constraint operational. Instead of telling an agent “implement rate limiting” and hoping it figures out the approach, the spec defines scope, components, contracts, and success criteria upfront. The agent implements against a specification, not a wish.

9.2.7 Hooks

Purpose: Define automated actions triggered by development events. Hooks bridge the gap between passive context (instructions, memory) and active behavior, making the instrumented codebase reactive rather than waiting to be queried.

File format: Configured via tool-specific hook mechanisms (e.g., VS Code tasks, GitHub Actions triggers, copilot hooks configuration) rather than a single portable file format.

Examples of what hooks automate:

  • Auto-run linting on save to catch convention violations before commit
  • Trigger test generation when a new source file is created
  • Auto-update memory files after a successful PR merge
  • Run a security-reviewer agent on every change to src/auth/

Hooks complete the instrumentation model. Without them, every instrumentation file is passive; it waits for an agent to be invoked and for the right context to load. With hooks, the context layer responds to events: a file save triggers a check, a new file triggers scaffolding, a merged PR triggers a memory update. The instrumented codebase stops being a library of reference material and starts behaving like an active participant in the development workflow.

Start with one or two hooks: a linting check on save and a test prompt on new file creation cover the highest-leverage triggers. Add more only when you observe a repeated manual step that a hook could eliminate.


9.3 Tool Support and Portability

The seven context artifact types are conceptual categories, not file format specifications. How each maps to a concrete file depends on which AI coding tool loads it. This table shows native support as of mid-2025; the landscape shifts quarterly, but the pattern is stable: every tool reads project-level markdown; the disagreement is about where it lives and what metadata it supports.1

Primitive GitHub Copilot (VS Code) Cursor Claude Code Windsurf OpenCode
Instructions .instructions.md + applyTo; copilot-instructions.md .cursor/rules/*.mdc + glob CLAUDE.md per directory .windsurfrules; cascade rules .opencode/instructions.md
Agents .agent.md with model + tools Via rules and agent modes /commands and agent configs
Skills SKILL.md dirs with examples Embed in rules with examples Embed in CLAUDE.md sections Embed in rules
Prompts .prompt.md with exec mode Rules or .cursor/prompts/ /commands definitions Flows (different model)
Memory SQLite DB; .memory.md portable Notepads; project-level rules CLAUDE.md sections; persistent memory In rules
Orchestration .spec.md as context; plan mode Context; composer plans Context; plan mode Loaded as context Loaded as context
Hooks Copilot hooks config; VS Code tasks Task runners; .cursor/hooks/ Hooks (hooks in settings); pre/post commands

“—” means the tool has no native format for that type. The knowledge is still usable (you embed it in whatever instruction format the tool does support) but automatic activation and scoping are lost.

NoteImplementation Reality

The seven primitives describe what your codebase needs to communicate to agents, not a universal file format. Tools implement these concepts differently:

  • Instructions have the broadest native support — .instructions.md (Copilot), .cursor/rules/ (Cursor), CLAUDE.md (Claude Code). Every major tool reads project-level markdown from predictable locations.
  • Memory is implemented as SQLite databases (GitHub Copilot), notepads (Cursor), or embedded in CLAUDE.md sections (Claude Code). The .memory.md format described here works as portable context loaded by any tool.
  • Orchestration specs (.spec.md) map to plan mode artifacts (plan.md in Copilot CLI) or are consumed as plain context files. No tool has a native “spec” file format yet.
  • Agents, Skills, and Prompts have, at the time of writing, the deepest native support in GitHub Copilot’s .agent.md, SKILL.md, and .prompt.md formats. Hooks are natively supported by Copilot, Cursor, and Claude Code; other tools achieve similar outcomes through their own configuration mechanisms.

The file names in this chapter are conventions that work today. As tools converge, expect native format support to broaden.

Three observations.

Instructions are the universal context file. Every major tool reads markdown from predictable locations and applies it as context. File naming and scoping differ (applyTo frontmatter, glob-based rule files, per-directory placement) but the underlying concept transfers without loss. This is where to invest first, regardless of tooling.

Agent configurations are the least portable. Chat modes, model selection, and tool boundaries are defined differently in every tool and don’t transfer. The knowledge inside them (domain expertise, named patterns, anti-patterns) is just markdown and moves freely. The activation mechanism does not.

The three major tools now support most primitives natively. GitHub Copilot, Cursor, and Claude Code each support the majority of the seven primitive types, though file formats and native integration depth differ. As of this writing, Copilot has the most extensive native format support, though the gap is narrowing rapidly as Cursor and Claude Code add primitive-equivalent features. OpenCode, the newest entrant, currently supports mainly instructions. Windsurf covers instructions and basic context loading. The methodology described here is portable across all of them — the knowledge transfers even when the wiring differs.

9.3.1 What happens when you switch tools

If your team uses one tool, optimize for its native formats. If you use multiple tools or expect to switch, organize your instrumentation files in two tiers:

Portable tier (works everywhere with minor adaptation): - Instruction content — the rules themselves, as markdown prose - Memory files — decisions, deprecations, historical context - Orchestration specs — requirements, contracts, validation criteria - Skill knowledge — decision frameworks, anti-patterns, examples

Tool-specific tier (requires per-tool configuration): - Instruction scoping — how rules get matched to files (applyTo vs. glob frontmatter vs. directory placement) - Agent configurations — model selection, tool boundaries, persona activation - Prompt execution — how workflows are triggered and parameterized

The portable tier is the knowledge, 80% of the value. The tool-specific tier is the wiring that connects knowledge to a particular editor. When you switch tools, you rewrite the wiring. That’s a few hours of adaptation, not a rewrite of what your team knows.

For teams working across multiple tools, the translation between formats can be handled manually by maintaining parallel directory structures, or automated with emerging tooling in this space.


9.4 Directory Structure

The seven primitive types organize into a predictable directory structure. The layout below follows GitHub Copilot conventions, the most complete native implementation of all seven types. Other tools use different file locations (see the compatibility table above), but the organizational principle holds: centralize context files in a known location, separate by type, keep each type flat.

project/
├── .github/
│   ├── copilot-instructions.md          # Global project principles
│   ├── instructions/
│   │   ├── api.instructions.md          # applyTo: "src/api/**"
│   │   ├── auth.instructions.md         # applyTo: "src/auth/**"
│   │   ├── frontend.instructions.md     # applyTo: "src/ui/**/*.tsx"
│   │   ├── testing.instructions.md      # applyTo: "**/test/**"
│   │   └── database.instructions.md     # applyTo: "src/db/**"
│   ├── agents/
│   │   ├── architect.agent.md        # Structure, patterns, trade-offs
│   │   ├── backend-dev.agent.md      # API implementation, business logic
│   │   ├── security-reviewer.agent.md # Injection, traversal, credentials
│   │   └── doc-writer.agent.md       # Documentation consistency
│   ├── skills/
│   │   ├── cli-logging-ux/
│   │   │   ├── SKILL.md
│   │   │   └── examples/
│   │   ├── error-handling/
│   │   │   └── SKILL.md
│   │   └── api-middleware/
│   │       └── SKILL.md
│   ├── prompts/
│   │   ├── code-review.prompt.md
│   │   ├── feature-impl.prompt.md
│   │   └── bug-investigation.prompt.md
│   └── specs/
│       ├── feature-template.spec.md
│       └── api-endpoint.spec.md
├── .memory.md                            # Project-level memory
├── AGENTS.md                             # Root discovery file
├── src/
│   ├── api/
│   │   └── AGENTS.md                    # API-specific context
│   ├── auth/
│   │   └── AGENTS.md                    # Auth-specific context
│   └── ...
└── ...

Three observations about this structure.

Context files live in .github/, not scattered through the source tree. This centralizes the knowledge layer; a developer looking for the project’s AI configuration finds it in one place. The exceptions are AGENTS.md files, which live in the directories they describe, because they’re part of a discovery hierarchy (Chapter 11 explains this in detail).

Each primitive type has its own directory. Instructions, agents, skills, prompts, specs, and hooks don’t mix. This makes it straightforward to audit what you have: how many instruction files exist, what domains they cover, which skills are defined. Mixed directories make this accounting harder than it needs to be.

The structure is flat within each directory. Resist the urge to create nested hierarchies. If you have 15 instruction files, a flat list with descriptive names is easier to scan than a three-level tree. If you have 50, you’re probably over-engineering; most projects need 8-12 instruction files.


9.5 How Primitives Compose

Primitives are not independent. They form a layered system where each type provides a different kind of guidance, and the agent’s effective context is the composition of all applicable context files for the current task.

The composition follows a hierarchy:

Global principles (copilot-instructions.md)
  └─ Scoped instructions (*.instructions.md, matched by applyTo)
      └─ Skills (activated by code patterns in the current task)
          └─ Agent configuration (persona, model, tool boundaries)
              └─ Prompt or spec (the specific workflow being executed)
                  └─ Memory (accumulated project context)
                      └─ Hooks (event-driven triggers, operating across all layers)

When an agent is asked to modify src/api/users.py, the effective context assembles from:

  1. Global principles — error handling, security, testing rules that apply everywhere
  2. API instructions — the applyTo: "src/api/**" file loads; frontend instructions do not
  3. API middleware skill — activates because the task involves an API route
  4. Backend-dev agent — provides the persona, model selection, and tool constraints
  5. Memory — the API versioning decisions, the deprecated authentication class, the rate limit timeout choice

Each layer adds specificity. None contradicts the layer above; more specific context files refine general guidance, they don’t override it. If a conflict exists, it indicates a design error in the instrumentation, not a resolution the agent should attempt.

This composition is the Explicit Hierarchy constraint made concrete. Global rules provide consistency. Scoped rules provide domain adaptation. Skills provide decision frameworks. Agent configurations provide expertise. Prompts and specs provide task structure. Memory provides historical context. Hooks provide event-driven automation that ties the layers together. Together, they give the agent the same information a tenured team member would bring to the task, without requiring that information to fit in anyone’s head.2


9.6 The Instrumentation Audit

Before you build any of this, you need to know what your codebase already has and what it’s missing. The instrumentation audit is a systematic inventory, not of your code, but of the knowledge your code depends on.

Step 1: List your conventions. Spend 30 minutes with your team. Write down every convention, pattern, and constraint that a new engineer would need to learn in their first two weeks. Don’t filter. Don’t organize.

You’ll typically get 30-60 items. Examples:

  • All error responses use the APIError class
  • Token refresh has a 3-retry limit with exponential backoff
  • The SessionAuth class is deprecated; use JWTAuth
  • Tests use factory functions, never inline object construction
  • Frontend components use the project’s design system, never raw HTML elements
  • Database migrations are reviewed by the DBA before merge
  • The logging module wraps Rich; never call print() directly

Step 2: Classify each item. For every convention, mark where it lives today:

Location Meaning Agent visibility
In code Expressed in types, naming, structure Partially visible — if it’s in the context window
In docs Written in a README, wiki, ADR, style guide Invisible unless explicitly loaded
In heads Known by team members, never written down Completely invisible

The “in heads” column is your instrumentation debt. Every item there is a convention an agent will violate because it has no way to know about it.

Step 3: Rank by failure cost.

  • Critical: Security vulnerabilities, data corruption, production outages
  • High: Architectural violations that accumulate as technical debt
  • Medium: Convention violations that require rework in code review
  • Low: Style preferences that don’t affect correctness

Step 4: Map each item to a primitive type.

If the knowledge… It belongs in…
Is a rule scoped to specific files/directories An instruction file
Requires specialist expertise or a specific model An agent configuration
Applies across files and needs a decision framework A skill
Defines a repeatable multi-step process A prompt
Records a decision, trade-off, or historical fact A memory file
Specifies a feature with components and success criteria A specification
Defines an automated response to a development event A hook

Step 5: Write your starter set. Begin with 3-5 context files covering your critical items. Don’t aim for completeness; the feedback loop will guide you to what’s actually needed faster than upfront planning will.


9.7 Before and After

Consider a mid-size backend service: 80,000 lines of Python, a REST API, a message queue consumer, an authentication module with some technical debt, and a CLI for operations tasks. The team has five engineers who’ve been working on it for two years.

9.7.1 Before: uninstrumented

project/
├── .github/
│   └── workflows/
│       └── ci.yml
├── README.md
├── src/
│   ├── api/
│   ├── auth/
│   ├── workers/
│   ├── cli/
│   └── models/
├── tests/
├── docs/
│   └── architecture.md
└── pyproject.toml

What happens when an agent is asked to “add a health check endpoint”:

  • It creates a new route in the API directory, using Flask patterns it learned from training data, but this project uses FastAPI
  • It imports a database connection check, but uses a raw connection instead of the project’s HealthChecker service
  • It returns a plain JSON response, ignoring the project’s standard response envelope ({"status": ..., "data": ..., "meta": ...})
  • It writes a test that passes, using inline object construction, violating the factory pattern the team uses everywhere else
  • It doesn’t register the route in the middleware pipeline because it doesn’t know the pipeline exists

Everything compiles. Tests pass. The PR gets three review comments, all of the form “we don’t do it that way here.” The reviewer rewrites 60% of the code. The agent saved time on a first draft and cost time on corrections.

9.7.2 After: instrumented

project/
├── .github/
│   ├── copilot-instructions.md
│   ├── instructions/
│   │   ├── api.instructions.md
│   │   ├── auth.instructions.md
│   │   ├── testing.instructions.md
│   │   └── cli.instructions.md
│   ├── agents/
│   │   ├── backend-dev.agent.md
│   │   └── doc-writer.agent.md
│   ├── skills/
│   │   └── api-middleware/
│   │       └── SKILL.md
│   ├── prompts/
│   │   └── new-endpoint.prompt.md
│   └── workflows/
│       └── ci.yml
├── .memory.md
├── AGENTS.md
├── README.md
├── src/
│   ├── api/
│   │   └── AGENTS.md
│   ├── auth/
│   │   └── AGENTS.md
│   ├── workers/
│   ├── cli/
│   └── models/
├── tests/
├── docs/
│   └── architecture.md
└── pyproject.toml

Same task. The agent now loads:

  1. Global instructions — error handling patterns, security rules, test requirements
  2. API instructions — FastAPI conventions, standard response envelope, route registration
  3. API middleware skill — the registration pipeline, middleware ordering
  4. Backend-dev agent — knows this is a FastAPI project, knows the service patterns
  5. New-endpoint prompt — step-by-step: check existing health patterns, register route, write factory-based test, update middleware

What it produces:

  • A FastAPI route using the project’s standard response envelope
  • A health check that delegates to HealthChecker, which already knows how to verify database, cache, and queue connectivity
  • Registration in the middleware pipeline via the standard pattern
  • A test using the project’s factory functions, following the naming convention, respecting the fixture hierarchy
  • No review comments about conventions, because the conventions were in the context

The difference is not the model. In the case documented in this book, the difference is 150 lines of markdown distributed across 8 files.

9.7.3 What the numbers look like

In the author’s experience, based on projects that have undergone this transformation, including the reference case documented throughout this book:

Metric Uninstrumented Instrumented
Convention-violating outputs 40-60% of generated code Under 10%
Review comments per agent PR 4-8 (“we don’t do it that way”) 0-2 (substantive, not stylistic)
Agent-generated code requiring rewrite 30-50% Under 15%
Time from agent output to merge Hours (review + rework) Minutes (spot-check)
These numbers are directional, not guaranteed. They depend on the quality of your context files and the complexity of your conventions. But the direction is consistent: instrumented codebases produce dramatically more reliable agent output than uninstrumented ones, with the same models, the same tools, and the same tasks.3

Based on the author’s experience across instrumented projects. Results will vary by codebase complexity, model, and instrumentation maturity.


9.8 The Feedback Loop

Instrumentation is not a one-time setup. It is a continuous practice, like testing.

When an agent produces incorrect output, the diagnosis follows a consistent pattern:

Failure observed
    |
    v
Root cause: which context file failed?
    |
    +-- Agent too generic?          --> Add domain knowledge to agent config
    +-- Skill rules incomplete?     --> Add the missing case to the skill
    +-- Instructions missing scope? --> Add a scoped instruction file
    +-- No decision framework?      --> Extract a new skill
    +-- Context gap?                --> Update the memory file
    +-- No repeatable process?      --> Create a prompt

Four examples from a real project, where fixing the instrumentation file fixed the class of error permanently:

Failure Root cause Context fix
Agent used _rich_info() directly instead of logger.progress() Skill didn’t explicitly ban direct calls Added “never call _rich_* directly in commands” to CLI skill
Agent invented a new collision detection pattern Instructions didn’t list all base-class methods Added “use, don’t reimplement” table to integrator instructions
Agent produced inconsistent Unicode symbols in output No single source of truth for status symbols Created STATUS_SYMBOLS reference in skill, added to anti-patterns
Agent used deprecated SessionAuth in new code Memory file didn’t record the deprecation Added deprecation notice with migration tracking reference

This feedback loop is how context file quality compounds over time. Every failure you diagnose and fix is a failure that never recurs. After 20-30 iterations, your instrumentation set covers the conventions that actually matter, not the ones you theorized about, but the ones agents actually violate. That practical grounding is what makes an instrumented codebase effective.


9.9 What It Looks Like: An Annotated Session

The following excerpts are from a real Copilot CLI session — not a reconstruction, not a sanitized demo. They are lightly trimmed for length but not edited for voice. The typos and urgency are part of the record.

NoteSession d89b3ccc

Repository: microsoft/apm · Branch: feat/auth-logging-architecture-393 Result: PR #394 — 75 files changed · Duration: 207 turns across 2 sessions · Full metrics

Turn 0 — The messy start. The session opens with the developer pasting raw terminal output of an authentication failure and role-casting the agent in a single breath. This is the Architect role from Chapter 8 — not writing a plan document, but framing the problem and the expertise needed to solve it:

“So I try to install a package from a GitHub EMU organization (requires auth) and it goes as: apm install mcaps-microsoft/mcaps-software-poya#v1.0.0 Validating 1 package(s)… ✗ mcaps-microsoft/mcaps-software-poya#v1.0.0 - not accessible or doesn’t exist […] Can you think as a world class UX expert and work with an APM logging mechanism implementation expert on what is going on here and how to fix the experience as a whole”

No YAML plan. No schema. A bug report, a role assignment, and a quality bar — all in natural language.

Turn 3 — Fleet deployment in six words. Once the agent drafted a plan, the developer’s entire orchestration command was:

“Fleet deployed: create an issue to track this Epic and then implement on a new branch”

The agent responded with 17 of 25 planned tasks completed, a GitHub issue created, and a feature branch pushed. Contrast this with the fabricated copilot-cli dispatch --scope commands that populate synthetic demos. Real orchestration is natural language with clear intent, not CLI flags.

Turns 4–5 — The feedback loop catches a silent semantic failure. The agent’s initial plan assumed EMU (Enterprise Managed Users) meant *.ghe.com hosts — a reasonable inference from the documentation, and one no test would flag. The developer caught it:

“Note that in the docs the doc writer seems to think EMU = *.ghe.com. But that’s not the only case. A github.com instance may be an EMU. *.ghe.com is for the Data Resident EMU offering. Hopefully this does not reflect faulty auth code and cases.”

One turn later, the correction sharpened into an instrumentation fix — not just correcting the code, but ordering the context file updated:

“Sorry but I do not think ‘ghu_’ is identifying EMUs. I have a ‘github_pat_’ starting token and it comes from an EMU, it’s fine grained. It’s rather ghp (classic PAT) and github_pat (fine grained pat) prefixes. ‘ghu_’ is for OAuth User Tokens. This is critical knowledge for the auth expert agent.md to be updated.”

That last sentence is this chapter’s entire thesis in twelve words. The developer didn’t just fix the code — they fixed the context file so every future agent inherits the correction. This is Anti-Pattern #6 from Chapter 14 (Silent Semantic Failure) caught and resolved at the primitive layer.

Turn 23 — Multi-agent debugging with real output. After initial fixes, auth still failed. The developer invoked the agent team and pasted verbose terminal output showing the full authentication resolution chain:

“It’s better but auth still does not work. Please work together with the agent team (auth expert, logging expert, doc expert) find the rca”

The agent traced the failure through real debug logs: Auth resolved → Trying unauthenticated → failed → retrying with token → 403 → trying credential fill → failed. Root cause: the x-access-token:{token}@host URL format sends Basic auth, which GitHub rejects with a 403 for fine-grained PATs. The fix was to switch to git -c http.extraHeader='Authorization: Bearer {token}'. Verbose instrumentation made the failure chain legible. Without it, the 403 would have been a mystery.

Turn 50 — Escalation to named agents. When single-agent debugging hit diminishing returns, the developer restructured the working group:

“considering the implementation you’ve already started, and realizing you’ve been hitting roadblocks, I want you now to work in plan mode with our custom .agent.md software architect and the logging expert with the related skill. These are 2 subagents that will work with you on the planning task.”

This is instrumentation at the orchestration layer. The developer didn’t write more code or more tests — they changed which agents were in the room and how they coordinated. The custom .agent.md files (described below) gave each agent persistent domain knowledge that outlives the session.

Turn 71 — Granular review, then delegation. The developer shifted into the Reviewer role (Chapter 8) at its most specific, then handed off to specialists:

“you are printing the hashes twice? […] And if we are in verbose mode, why aren’t you printing live in the logs the 22 files we skipped? […] I’ll let the logging expert and UX expert on world class logs plan and decide. This expert agent needs to look at each of these commands with a team of clone subagents working for him and assess whether this is really killer full verbose logs”

The pattern: review at the level of individual log lines, then delegate the fix to an agent whose .agent.md defines the quality bar. The developer sets the standard; the agents implement it.

Turn 73 — The instrumentation payoff. This is the moment that distinguishes a good session from a lasting one:

“you must ensure you update the logging skill so that such architectural concerns and patterns are enshrined — not as unmovable, but as current art and baseline”

Not a code fix. A primitive fix. The developer updated the logging skill file so that every future agent working on logging — in this session, in next month’s session, by a different developer — inherits the patterns discovered here. One fix, permanent prevention. This is the feedback loop described earlier in this chapter, operating at production scale.

What the agent files look like. The .agent.md files referenced throughout this session are not abstractions. Here is the auth expert, created during the session and refined across turns 4–5:

---
name: auth-expert
description: >-
  Expert on GitHub authentication, EMU, GHE, ADO, and APM's
  AuthResolver architecture. Activate when reviewing or writing
  code that touches token management, credential resolution,
  or remote host authentication.
model: claude-opus-4.6
---
# Auth Expert

You are an expert on Git hosting authentication across
GitHub.com, GitHub Enterprise (*.ghe.com, GHES), Azure DevOps,
and generic Git hosts.

## Core Knowledge

- **Token prefixes**: Fine-grained PATs (`github_pat_`),
  classic PATs (`ghp_`), OAuth user tokens (`ghu_`)...
- **EMU**: Use standard PAT prefixes. No special prefix —
  it's a property of the account, not the token.
- **Host classification**: github.com (public), *.ghe.com
  (no public repos), GHES, ADO

This is what “update the auth expert agent.md” means in practice. The domain correction from Turn 5 — that EMU tokens use standard PAT prefixes, not a special ghu_ prefix — lives here permanently. Every agent that activates this file inherits that knowledge without being told twice.

TipTurn 52 — The meta-moment

Midway through the session, the developer paused and recognized the pattern:

“Look at the planning and agent teams orchestration pattern we used, with waves, with task dependencies, with checkpoints that include panel discussions […] with the specific skills and custom agent files used to instantiate the different flavor of specialized subagents […] This should become a handbook for AI Engineers to leverage.”

This book was partly born from the patterns it describes. By Turn 79, the developer applied the same orchestration structure — specialized agents, panel review, escalation gates — to set up a documentation team for this handbook itself. The pattern is fractal: it works for code, for knowledge work, for any domain where quality requires multiple perspectives and persistent context.

Three things to notice.

The developer’s value was domain knowledge, not keystrokes. Across 207 turns and 75 changed files, the developer wrote no application code. Their contributions were the problem frame (Turn 0), the domain corrections (Turns 4–5), the team composition decisions (Turns 50, 71), and the context file fixes (Turns 5, 73). This is the role shift Chapter 8 described: architect, reviewer, escalation handler — operating through instrumentation rather than implementation.

Every correction became a permanent artifact. A developer without instrumentation would have fixed the auth code and moved on. The same EMU confusion would recur next month, with a different agent, on a different task. Instead, the correction went into the auth-expert.agent.md (Turn 5) and the logging skill (Turn 73) — context files that prevent the class of error, not just the instance. This is the feedback loop that makes instrumentation compound over time.

The hardest bugs were semantic, not syntactic. The agent’s assumption that EMU means *.ghe.com was plausible, consistent with documentation, and wrong. No linter catches that. No test catches it unless the test author already knows the correct answer — in which case you don’t need the test. This is Anti-Pattern #6 from Chapter 14: Silent Semantic Failure. The safety net was a human with domain expertise reviewing agent output with enough context (verbose logs, Turn 23) to see where the reasoning broke. Agent-augmented development does not eliminate the need for human judgment. It changes where that judgment is applied — and instrumentation determines whether it has lasting effect.


9.10 Starting Points

The seven primitives and the full directory structure represent a mature instrumented codebase. You don’t need to build all of it at once. Start where the leverage is highest.

Week one. Write three files: 1. Global instructions (10-15 lines) — your non-negotiable principles 2. One scoped instruction file for your most-edited module 3. One agent configuration for the task agents perform most often

Week two. Use these files on real work. When the agent violates a convention not covered by your files, add it. When it does something right that surprised you, check whether your instrumentation contributed. Update the memory file with the decisions and trade-offs you resolved this week.

Week three. Extract the first skill. You’ll know it’s time because you’ve written the same guidance in two different instruction files. Package the shared knowledge as a skill with a decision framework. Create your first prompt file for a task you’ve now asked an agent to do three or more times. Ongoing. Review context files monthly. Remove rules that never trigger. Tighten rules that trigger but don’t prevent the failure. Add new rules only in response to observed failures. Treat your instrumentation like a test suite; it should grow with the codebase, stay accurate, and never contain dead rules.

The instrumented codebase is not a finished state. It’s a practice, an ongoing investment in making machine-readable what your team already knows. Chapter 10 defines the constraints that make these context artifacts effective. Chapter 11 teaches the context engineering discipline that determines how and when they load. This chapter showed you the artifacts themselves.

The directory structure and primitive types in this chapter can be built by hand, and for a first project, doing so builds understanding. For subsequent projects, or for teams standardizing across repositories, the mechanical work of scaffolding and sharing benefits from a distribution mechanism, the same pattern npm brought to JavaScript modules. This category of tooling is new. One open-source implementation is APM (Agent Package Manager), built by the author of this book. You can build everything in this chapter without it; tooling reduces the scaffolding work from hours to seconds.

Shortcut: scaffolding with a package manager

apm init generates copilot-instructions.md, one scoped instruction file, and one agent configuration, the Week One starter set from this chapter. apm install pulls shared context files from any Git repository into your project.

NoteFrom primitives to frameworks

The primitives in this chapter — instructions, agents, skills, prompts, memory — are the library layer of the agentic computing stack introduced in Chapter 4. Above them, framework-layer tooling is beginning to emerge: opinionated systems that compose these primitives into complete workflows.

Two early examples. Squad (github.com/bradygaster/squad) provides repository-native multi-agent orchestration built on GitHub Copilot, consuming instruction and agent primitives to dispatch specialized agents in coordinated workflows4. Spec-Kit (github.com/github/spec-kit) takes spec-driven development further — specifications become executable artifacts rather than passive context5.

Convergence is also visible across vendors. Claude’s plugin.json manifest independently arrived at the same primitive-bundling pattern as APM: a declarative package that bundles skills, agents, hooks, and MCP servers into versioned, distributable units with namespaced composition and marketplace distribution6. When independent implementations from different vendors converge on the same structure, the underlying abstraction is sound.

An honest hedge: these tools are early. The pattern — frameworks that consume standardized primitives — is durable. The specific tools will evolve, merge, or be replaced. Your investment should be in the primitives themselves (this chapter), not in any single framework. If the primitives are well-structured, framework migration is a packaging change, not a rewrite.

Now you know what to build. What follows is how to make it work.


  1. This comparison reflects mid-2025 capabilities. The AI coding tool landscape evolves quarterly; verify current capabilities at each vendor’s documentation site.↩︎

  2. The principle of scoped, hierarchical configuration mirrors established patterns in software engineering — from CSS specificity to Git’s nested .gitignore files. The key insight is that agents benefit from the same locality-of-reference that makes these patterns effective for humans.↩︎

  3. Static analysis tools (ESLint, Pylint, Ruff) catch syntactic violations but not semantic ones — an agent can pass all linters while violating domain conventions. The instruction files complement, rather than replace, automated linting. See also the discussion of Silent Failures in Chapter 14.↩︎

  4. Brady Gaster, “How Squad Runs Coordinated AI Agents Inside Your Repository,” GitHub Blog, March 2026. https://github.blog/ai-and-ml/github-copilot/how-squad-runs-coordinated-ai-agents-inside-your-repository/↩︎

  5. GitHub, “Spec Kit — Build High-Quality Software Faster,” https://github.com/github/spec-kit↩︎

  6. Anthropic, “Claude Code Plugins,” https://docs.anthropic.com/en/docs/claude-code/plugins↩︎

📕 Get the PDF & EPUB — free download

Plus ~1 update/month max. No spam. Unsubscribe anytime.

Download the Handbook

CC BY-NC-ND 4.0 © 2025-2026 Daniel Meppiel · CC BY-NC-ND 4.0

Free to read and share with attribution. License details