block-beta
columns 4
A["Instructions<br/>.instructions.md<br/>─<br/>Scoped conventions<br/>per file/directory"]:1
B["Agents<br/>.agent.md<br/>─<br/>Specialist personas<br/>with tool boundaries"]:1
C["Skills<br/>SKILL.md<br/>─<br/>Reusable decision<br/>frameworks"]:1
D["Prompts<br/>.prompt.md<br/>─<br/>Repeatable<br/>workflows"]:1
E["Memory<br/>.memory.md<br/>─<br/>Cross-session<br/>knowledge"]:1
F["Orchestration<br/>.spec.md<br/>─<br/>Execution-ready<br/>specifications"]:1
G["Hooks<br/>event-driven<br/>─<br/>Automated actions<br/>on dev events"]:1
10 The Instrumented Codebase
Chapter 9 named the four parts of the agentic runtime machine — Model, harness, agent source code, client — and identified the agent source code as the part the developer programs against. Maya’s port from Copilot to Claude Code failed at exactly that layer: same files, same model family, different compiler reading them. This chapter is the catalogue of files you put on disk for the compiler to read. Each entry names what the file does and which load mode it triggers — eager preload, lazy on-demand, dispatcher-mediated, user-invoked, event-driven — vocabulary Chapter 12 (Chapter 12) develops in mechanical detail.
A repository instrumented for agentic work is full of markdown that isn’t documentation. The files are parsed, linked, loaded, and executed — Chapter 9’s four properties of a code artifact. They version-control alongside the application code they steer, review in pull requests, and create a feedback loop: when an agent misbehaves, you fix the file that should have prevented the mistake, not the generated code. Seven primitive types cover the working set, each addressing a distinct gap between what is in the source and what the agent needs to know.
The word primitive is deliberate. These artifacts are the atomic units of agentic behaviour. Like functions in code, they do one thing, they compose, and they’re testable in isolation. Unlike prompts typed into a chat window and forgotten, they persist, accumulate value, and improve through iteration. What makes a primitive useful — the prose discipline that fits inside it — is Chapter 11 (Chapter 11). What makes a primitive bind — when the harness loads it and against which budget — is Chapter 12. What it costs once loaded is Chapter 13 (Chapter 13). This chapter is the prior question: what, specifically, are you building?
10.1 Two kinds of knowledge
Every mature project carries two kinds of knowledge. The first lives in the code itself — types, function signatures, directory structure, test assertions. Any agent can read this. The second lives in the team’s heads: which authentication pattern is current and which is deprecated, why the logging module wraps the standard library, what “follows the BaseIntegrator pattern” means in practice, why one directory has different import rules than every other. An agent cannot read this. It will guess, and it will guess wrong.
Instrumentation is the practice of converting the second kind into structured files the harness loads as context. The catalogue that follows is the working vocabulary for that practice.
10.2 The Seven Primitive Types
Seven categories cover the full range of knowledge an agent needs. Each addresses a distinct gap between what’s in the code and what an agent needs to know. Not every project needs all seven on day one (the instrumentation audit later in this chapter helps you decide where to start) but understanding the complete set is necessary before making that decision.
Each primitive carries an implicit load mode — the moment the harness decides to read it into the model’s context. There are five modes in play, and naming them up front lets the catalogue read as a typed system rather than a list of file extensions. Three modes are deterministic: eager preload (the harness reads the file at session start, scoped or unscoped); dispatcher-mediated (the harness routes a thread to a specialist via a delegation tool); event-driven (a client process — a workflow runner, a webhook receiver, a scheduler — invokes the harness against the file in response to an outside event). Two are agent- or user-mediated: lazy on-demand (the agent itself decides to load the file based on a description match); user-invoked (the developer runs the file as a workflow). Chapter 12 covers the mechanics of each — token budgets, transitive closure, when bindings silently fail. For now it is enough to know which mode each primitive triggers.
10.2.1 Instructions
Load mode: eager preload, scoped by applyTo glob. The harness reads matching instruction files into context whenever a thread touches a path the glob covers — Maya’s api.instructions.md from Chapter 9 is the canonical example.
Purpose: Encode project conventions scoped to specific files, directories, or file types. Instructions are the most granular context artifact; they tell an agent “when you touch code in this scope, follow these rules.”
File format: .instructions.md with frontmatter specifying scope.
---
applyTo: "src/api/**"
description: "API layer conventions for endpoint implementation"
---
# API Development Rules
## Middleware Registration
- All middleware decorators are registered in `middleware.py`, never inline on routes.
- Route files define endpoint logic only.
## Rate Limiting
- Use `app.rate_limiter.RateLimiter`, not third-party libraries.
The internal implementation integrates with the metrics pipeline.
- Rate limit values come from environment variables, never hardcoded.
## Error Responses
- All error responses use `APIError.from_exception()` for consistent format.
- Never return raw exception messages to clients.Design test: Can you state the scope in one applyTo pattern? Does every rule in the file apply to that scope? If you’re writing rules that apply to two unrelated domains, split the file. If you can’t express the scope as a glob, the knowledge probably belongs in an agent configuration or a skill instead.
What distinguishes a good instruction file from a bad one: length. If your instruction file exceeds 40-50 lines, it’s trying to do too much. The reason is mechanical: every line of instruction competes for attention with the source code the agent needs to read. A 200-line instruction file doesn’t give an agent more to work with. It gives it more to get lost in.
10.2.2 Agents
Load mode: dispatcher-mediated. The persona file isn’t preloaded; a parent thread (or the developer) routes a sub-thread to it through the harness’s delegation tool. The selected agent’s frontmatter then governs the new thread’s model, tool set, and instructions. Chapter 15 builds on this seam.
Purpose: Define specialist personas with domain expertise, calibrated judgment, and explicit behavioral boundaries. An agent configuration is the answer to “who should work on this?”, not in terms of a human team member, but in terms of what expertise, priorities, and constraints the task requires.
File format: .agent.md with frontmatter specifying the model, tools, and description.
---
description: "Backend architecture specialist for Python services"
tools: ["changes", "codebase", "editFiles", "runCommands",
"search", "problems", "testFailure"]
model: claude-sonnet-4.5
---# Python Architect
You are an expert Python architect specializing in CLI tool design
and modular service architecture.
## Design Philosophy
- Speed and simplicity over complexity
- Solid foundation that can be iterated on
- Operations proportional to affected files, not workspace size
## Patterns You Enforce
- BaseIntegrator for all file-level integrators
- CommandLogger for all CLI output
- AuthResolver for all credential access
## You Never
- Add a new base class when an existing one can be extended
- Instantiate AuthResolver per-request (it is a singleton)
- Import from integration/ in the CLI layer (use the public API)Four elements make an agent configuration effective:
- Domain expertise. Specific enough to constrain the agent’s decisions. “You are an expert Python developer” is too broad to be useful. “You specialize in CLI tool design using the Click framework with Rich terminal output” constrains the solution space meaningfully.
- Named patterns. When the agent knows patterns by name (“BaseIntegrator,” “CredentialChain,” “CommandLogger”) it can reference them in its reasoning and produce code that uses them correctly.
- Anti-patterns. What the agent must never do. These encode institutional memory; each item represents a mistake that happened at least once and cost the team time to fix.
- Tool boundaries. Which tools the agent can invoke. A documentation agent shouldn’t execute destructive commands. A frontend agent shouldn’t access backend databases. Tool boundaries are safety boundaries made concrete.
Start with three to five agent configurations. An architect, a domain expert for your core business logic, and a documentation writer cover most tasks. Add configurations when you observe repeated corrections; that’s the signal that a new specialization has earned its place.
10.2.3 Skills
Load mode: lazy on-demand, description-driven activation. The harness preloads only the skill’s description frontmatter; the body stays unread until the agent itself decides — based on the description matching the current task — to pull the rest into context. The contract is open: the agentskills.io registry standardises both the SKILL.md entrypoint and the activation predicate, and several harnesses now read substantially compatible skill bundles.1
Purpose: Package reusable decision frameworks that activate based on code patterns. A skill is more than a set of rules; it teaches an agent how to think about a specific domain.
File format: A directory containing a SKILL.md file, optionally with examples and supporting context.
.github/skills/
└── cli-logging-ux/
├── SKILL.md
└── examples/
├── good-warning.py
└── bad-warning.py
---
name: cli-logging-ux
description: >
Activate whenever code touches console helpers, DiagnosticCollector,
STATUS_SYMBOLS, or any user-facing terminal output.
---
# CLI Logging UX
## Decision Framework
### The "So What?" Test
Every warning must answer: what should the user do about this?
A warning without a suggested action is noise.
### The Traffic Light Rule
| Color | Helper | Meaning |
|--------|------------------|--------------------|
| Green | _rich_success() | Completed |
| Yellow | _rich_warning() | User action needed |
| Red | _rich_error() | Cannot continue |
| Blue | _rich_info() | Status update |
### The Newspaper Test
Can a user scan the output like headlines?
If they have to read paragraphs to understand status, restructure.
## Anti-Patterns
- Never use bare print() or click.echo() without styling
- Never emit a warning without an actionable suggestion
- Never mix Rich and colorama in the same output pathThe design test for a skill is three criteria: Does this knowledge apply across multiple files? Does it require more than a few rules to express? Is it triggered by a detectable code pattern? If all three are yes, it’s a skill. If the knowledge applies to a single directory, it’s an instruction. If it’s a general disposition, it’s an agent configuration.
Skills differ from instructions in an important way: they provide decision frameworks, not just rules. A rule says “use _rich_warning() for warnings.” A decision framework says “every warning must answer ‘what should the user do about this?’” The framework generalizes to situations the author didn’t anticipate. Rules cover known cases. Frameworks cover unknown ones.
The design test this section just defined is the same test I had to apply to the design discipline itself before I packaged it; the result is the agent-side companion Part III pointed at.2
10.2.4 Prompts
Load mode: user-invoked. The developer runs the prompt as a workflow — typically through a slash-command surface or a CLI flag — and the harness reads the file as the opening turn. The prompt is parameterised, repeatable, and explicit; it is the agentic equivalent of a script or a makefile target.
Purpose: Define reusable, parameterized workflows that orchestrate multi-step tasks. A prompt is a repeatable process, the agentic equivalent of a script or a makefile target.
File format: .prompt.md with frontmatter specifying execution mode and tools.
---
mode: agent
description: "Code review workflow with security and architecture checks"
---
# Structured Code Review
## Context Loading
1. Read the [project architecture](../../docs/architecture.md)
2. Review the diff to understand scope of changes
3. Check [security guidelines](../../docs/security.md) for relevant patterns
## Review Phases
### Phase 1: Correctness
- Does the code do what the PR description claims?
- Are edge cases handled?
- Do tests cover the new behavior?
### Phase 2: Architecture
- Does this change respect existing module boundaries?
- Are new dependencies justified and minimal?
- Would a senior engineer on this team approve the approach?
### Phase 3: Security
- Is user input validated at the boundary?
- Are credentials handled through the standard resolver?
- Could this change introduce injection, traversal, or leakage?
## Output Format
Provide findings grouped by severity (Critical / High / Medium / Low).
For each finding: file, line, what's wrong, what to do instead.Prompts are the bridge between ad-hoc chat and systematic workflows. Without them, every time a developer wants an agent to perform a code review, they type a slightly different request and get slightly different quality. With a prompt file, the process is consistent: the same phases, the same checks, the same output format. Quality becomes reproducible.
10.2.5 Memory
Load mode: eager preload, persistent. The harness reads the memory file (or the equivalent store) at every session start — the file is the only thing that survives the per-thread amnesia identified in Chapter 9. Memory is what makes a session that ends at 6pm useful to a session that starts at 9am tomorrow.
Purpose: Preserve knowledge across sessions. Agents are stateless; every conversation starts from zero. Memory files give them access to accumulated decisions, resolved trade-offs, and project history that would otherwise vanish between sessions.
File format: .memory.md, structured by domain.
# Project Decisions
## Authentication (last updated: 2025-06-15)
- Migrated from session-based to JWT auth in Q1 2025
- Token refresh uses exponential backoff, max 3 retries
- EMU (Enterprise Managed User) tokens use `ghu_` prefix ← INCORRECT, agent error
- CORRECTION: EMU tokens use standard PAT prefixes (`github_pat_` / `ghp_`). `ghu_` is OAuth.
- The `SessionAuth` class is deprecated but not yet removed.
Do NOT use it for new code. Migration tracked in JIRA-4521.
## API Versioning (last updated: 2025-05-20)
- v1 endpoints frozen. No new features, security fixes only.
- v2 is the active version. All new endpoints go here.
- Versioning is URL-based (/v1/, /v2/), not header-based.
This was debated and decided in ADR-017.
## Performance Decisions (last updated: 2025-07-01)
- Database connection pooling uses pgbouncer, not application-level pools.
- Cache invalidation is event-driven (via message queue), not TTL-based.
- The /api/search endpoint has a 5-second timeout. This is intentional —
longer queries must use the async search endpoint.Memory files capture the knowledge that doesn’t fit in instructions because it isn’t a rule; it’s context. The distinction matters: “use JWT for authentication” is a rule (instruction). “We migrated from sessions to JWT in Q1, and the old SessionAuth class is still in the code but deprecated” is context (memory). An agent that knows only the rule might accidentally use the deprecated class. An agent that also has the memory won’t.
Memory files are the most likely primitive to drift from reality. Include dates. Review them quarterly. If a section hasn’t been updated in six months, verify that it’s still accurate or remove it.
While this chapter describes memory as markdown files for portability, some tools implement memory as structured databases. GitHub Copilot, for example, stores memory in a database system rather than flat files. The principles are the same (persistence, discoverability, and staleness management) regardless of the storage mechanism.
10.2.6 Orchestration
Load mode: user-invoked workflow. A specification is loaded as the opening context for a planned implementation run — typically by a developer (or a parent thread) handing the file to a fresh session as the brief. The spec is read once, deterministically, and the rest of the run executes against it.
Purpose: Bridge planning and implementation by defining structured specifications that can be executed by humans or agents with the same precision. Orchestration files decompose large features into implementation-ready units.
File format: .spec.md for specifications, or workflow composition files that coordinate multiple context files.
# Feature: Rate Limiting for Public API
## Problem Statement
Public API endpoints have no rate limiting. A single client can
exhaust server resources with sustained high-frequency requests.
## Approach
Implement per-client rate limiting using the internal RateLimiter
with Redis-backed sliding window counters.
## Implementation Requirements
### Components
- [ ] Rate limiter middleware (`src/middleware/rate_limiter.py`)
- [ ] Redis counter service (`src/services/rate_counter.py`)
- [ ] Configuration loader (`src/config/rate_limits.yaml`)
### API Contracts
- 429 response with `Retry-After` header when limit exceeded
- `X-RateLimit-Remaining` header on every response
- Per-endpoint configuration via environment variables
### Validation Criteria
- [ ] Handles concurrent requests without race conditions
- [ ] Sliding window accurate within 1-second granularity
- [ ] Unit tests > 90% coverage on counter logic
- [ ] Load test: 10K requests/second without limiter degradationSpecification files matter because they make the “Reduced Scope” constraint operational. Instead of telling an agent “implement rate limiting” and hoping it figures out the approach, the spec defines scope, components, contracts, and success criteria upfront. The agent implements against a specification, not a wish.
10.2.7 Hooks
Load mode: event-driven, harness-mediated. The hook file is agent source code; what fires it is a client process — a cron daemon, a webhook receiver, a CI runner, a gh-aw workflow, an IDE save handler — that observes an outside event (a file save, a git commit, a pull-request webhook, a scheduled tick) and invokes the harness against the hook’s bootstrap context. Clients can be interactive (terminals, IDEs) or programmatic (workflows, schedulers, webhook receivers); the Client role from Chapter 9’s four-part vocabulary is the layer hooks live behind.
Purpose: Define automated actions triggered by development events. Hooks bridge the gap between passive context (instructions, memory) and active behavior, making the instrumented codebase reactive rather than waiting to be queried.
File format: Configured via tool-specific hook mechanisms (e.g., VS Code tasks, GitHub Actions triggers, copilot hooks configuration) rather than a single portable file format.
Examples of what hooks automate:
- Auto-run linting on save to catch convention violations before commit
- Trigger test generation when a new source file is created
- Auto-update memory files after a successful PR merge
- Run a security-reviewer agent on every change to
src/auth/
Hooks complete the instrumentation model. Without them, every instrumentation file is passive; it waits for an agent to be invoked and for the right context to load. With hooks, the context layer responds to events: a file save triggers a check, a new file triggers scaffolding, a merged PR triggers a memory update. The instrumented codebase stops being a library of reference material and starts behaving like an active participant in the development workflow.
Start with one or two hooks: a linting check on save and a test prompt on new file creation cover the highest-impact triggers. Add more only when you observe a repeated manual step that a hook could eliminate.
10.3 Tool support, in brief
The seven types are conceptual categories, not file format specifications; how each maps to a concrete file depends on which harness reads it. Every major harness reads project-level markdown from predictable locations — the disagreement is about where the file lives and what frontmatter it accepts. Instructions and skills port most cleanly across vendors; agent activation and prompt invocation are the least portable surfaces. Appendix A holds the dated cross-harness reference matrix; consult it when you actually need to port a project. The practical division is two-tier: a portable tier of body prose (instructions, memory entries, specs, skill decision frameworks) that moves freely as markdown, and a harness-specific tier of wiring (scoping syntax, agent activation, prompt invocation, hook configuration) that you rewrite on a port. The knowledge is roughly 80% of the value; the wiring is the rest.
10.4 Directory Structure
The seven primitive types organize into a predictable directory structure. The layout below follows GitHub Copilot conventions, the most complete native implementation of all seven types. Other harnesses use different file locations (Appendix A is the dated reference), but the organizational principle holds: centralize context files in a known location, separate by type, keep each type flat.
project/
├── .github/
│ ├── copilot-instructions.md # Global project principles
│ ├── instructions/
│ │ ├── api.instructions.md # applyTo: "src/api/**"
│ │ ├── auth.instructions.md # applyTo: "src/auth/**"
│ │ ├── frontend.instructions.md # applyTo: "src/ui/**/*.tsx"
│ │ ├── testing.instructions.md # applyTo: "**/test/**"
│ │ └── database.instructions.md # applyTo: "src/db/**"
│ ├── agents/
│ │ ├── architect.agent.md # Structure, patterns, trade-offs
│ │ ├── backend-dev.agent.md # API implementation, business logic
│ │ ├── security-reviewer.agent.md # Injection, traversal, credentials
│ │ └── doc-writer.agent.md # Documentation consistency
│ ├── skills/
│ │ ├── cli-logging-ux/
│ │ │ ├── SKILL.md
│ │ │ └── examples/
│ │ ├── error-handling/
│ │ │ └── SKILL.md
│ │ └── api-middleware/
│ │ └── SKILL.md
│ ├── prompts/
│ │ ├── code-review.prompt.md
│ │ ├── feature-impl.prompt.md
│ │ └── bug-investigation.prompt.md
│ └── specs/
│ ├── feature-template.spec.md
│ └── api-endpoint.spec.md
├── .memory.md # Project-level memory
├── AGENTS.md # Root discovery file
├── src/
│ ├── api/
│ │ └── AGENTS.md # API-specific context
│ ├── auth/
│ │ └── AGENTS.md # Auth-specific context
│ └── ...
└── ...
Three observations. Context files live in .github/, centralised so the knowledge layer is findable in one place; the exceptions are AGENTS.md files, which live in the directories they describe because they participate in a discovery hierarchy Chapter 11 covers. Each primitive type has its own directory — instructions, agents, skills, prompts, specs, and hooks don’t mix; this makes auditing what you have straightforward. The structure is flat within each directory. Resist the urge to nest. A flat list of 15 descriptively-named files scans faster than a three-level tree, and most projects need 8-12 instruction files, not 50.
10.5 How Primitives Compose
Primitives are not independent. They form a layered system, and the agent’s effective context for a task is the composition of every applicable file:
flowchart TD
G["<b>Global principles</b><br/><span style='font-size:11px'>copilot-instructions.md</span>"]
I["<b>Scoped instructions</b><br/><span style='font-size:11px'>*.instructions.md, matched by applyTo</span>"]
S["<b>Skills</b><br/><span style='font-size:11px'>activated by code patterns</span>"]
A["<b>Agent configuration</b><br/><span style='font-size:11px'>persona, model, tool boundaries</span>"]
P["<b>Prompt or spec</b><br/><span style='font-size:11px'>the specific workflow being executed</span>"]
M["<b>Memory</b><br/><span style='font-size:11px'>accumulated project context</span>"]
H["<b>Hooks</b><br/><span style='font-size:11px'>event-driven triggers, cross-cutting</span>"]
G --> I --> S --> A --> P --> M
H -.->|cross-cutting| G
H -.-> I
H -.-> S
H -.-> A
H -.-> P
H -.-> M
classDef layer fill:#f5f5f5,stroke:#333,stroke-width:1px,color:#000
classDef cross fill:#fff3e0,stroke:#e65100,stroke-width:1px,color:#000
class G,I,S,A,P,M layer
class H cross
When an agent is asked to modify src/api/users.py, the effective context assembles from global principles, the applyTo: "src/api/**" instruction file (frontend instructions stay out), the API middleware skill (activated by route patterns), the backend-dev agent (persona, model, tools), and the memory file (versioning decisions, deprecated SessionAuth, rate limit timeouts). Each layer adds specificity; none contradicts the layer above. A conflict indicates a design error in the instrumentation, not a resolution the agent should attempt.
This composition is what Chapter 11’s Explicit Hierarchy constraint makes concrete. Global rules provide consistency, scoped rules provide domain adaptation, skills provide decision frameworks, agent configurations provide expertise, prompts and specs provide task structure, memory provides history, hooks provide event-driven reactivity. Together, they give the agent the same information a tenured team member would bring to the task — without requiring that information to fit in anyone’s head.3
10.6 The Instrumentation Audit
Before you build any of this, you need to know what your codebase already has and what it’s missing. The instrumentation audit is a systematic inventory, not of your code, but of the knowledge your code depends on.
Step 1: List your conventions. Spend 30 minutes with your team. Write down every convention, pattern, and constraint that a new engineer would need to learn in their first two weeks. Don’t filter. Don’t organize.
You’ll typically get 30-60 items. Examples:
- All error responses use the
APIErrorclass - Token refresh has a 3-retry limit with exponential backoff
- The
SessionAuthclass is deprecated; useJWTAuth - Tests use factory functions, never inline object construction
- Frontend components use the project’s design system, never raw HTML elements
- Database migrations are reviewed by the DBA before merge
- The logging module wraps Rich; never call
print()directly
Step 2: Classify each item. For every convention, mark where it lives today:
| Location | Meaning | Agent visibility |
|---|---|---|
| In code | Expressed in types, naming, structure | Partially visible — if it’s in the context window |
| In docs | Written in a README, wiki, ADR, style guide | Invisible unless explicitly loaded |
| In heads | Known by team members, never written down | Completely invisible |
The “in heads” column is your instrumentation debt. Every item there is a convention an agent will violate because it has no way to know about it.
Step 3: Rank by failure cost.
- Critical: Security vulnerabilities, data corruption, production outages
- High: Architectural violations that accumulate as technical debt
- Medium: Convention violations that require rework in code review
- Low: Style preferences that don’t affect correctness
Step 4: Map each item to a primitive type.
| If the knowledge… | It belongs in… |
|---|---|
| Is a rule scoped to specific files/directories | An instruction file |
| Requires specialist expertise or a specific model | An agent configuration |
| Applies across files and needs a decision framework | A skill |
| Defines a repeatable multi-step process | A prompt |
| Records a decision, trade-off, or historical fact | A memory file |
| Specifies a feature with components and success criteria | A specification |
| Defines an automated response to a development event | A hook |
Step 5: Write your starter set. Begin with 3-5 context files covering your critical items. Don’t aim for completeness; the feedback loop will guide you to what’s actually needed faster than upfront planning will.
10.7 Before and After
Consider a mid-size backend service: 80,000 lines of Python, a REST API, a message queue consumer, an authentication module with some technical debt, and a CLI for operations tasks. The team has five engineers who’ve been working on it for two years.
10.7.1 Before: uninstrumented
project/
├── .github/
│ └── workflows/
│ └── ci.yml
├── README.md
├── src/
│ ├── api/
│ ├── auth/
│ ├── workers/
│ ├── cli/
│ └── models/
├── tests/
├── docs/
│ └── architecture.md
└── pyproject.toml
What happens when an agent is asked to “add a health check endpoint”:
- It creates a new route in the API directory, using Flask patterns it learned from training data, but this project uses FastAPI
- It imports a database connection check, but uses a raw connection instead of the project’s
HealthCheckerservice - It returns a plain JSON response, ignoring the project’s standard response envelope (
{"status": ..., "data": ..., "meta": ...}) - It writes a test that passes, using inline object construction, violating the factory pattern the team uses everywhere else
- It doesn’t register the route in the middleware pipeline because it doesn’t know the pipeline exists
Everything compiles. Tests pass. The PR gets three review comments, all of the form “we don’t do it that way here.” The reviewer rewrites 60% of the code. The agent saved time on a first draft and cost time on corrections.
10.7.2 After: instrumented
project/
├── .github/
│ ├── copilot-instructions.md
│ ├── instructions/
│ │ ├── api.instructions.md
│ │ ├── auth.instructions.md
│ │ ├── testing.instructions.md
│ │ └── cli.instructions.md
│ ├── agents/
│ │ ├── backend-dev.agent.md
│ │ └── doc-writer.agent.md
│ ├── skills/
│ │ └── api-middleware/
│ │ └── SKILL.md
│ ├── prompts/
│ │ └── new-endpoint.prompt.md
│ └── workflows/
│ └── ci.yml
├── .memory.md
├── AGENTS.md
├── README.md
├── src/
│ ├── api/
│ │ └── AGENTS.md
│ ├── auth/
│ │ └── AGENTS.md
│ ├── workers/
│ ├── cli/
│ └── models/
├── tests/
├── docs/
│ └── architecture.md
└── pyproject.toml
Same task. The agent now loads:
- Global instructions — error handling patterns, security rules, test requirements
- API instructions — FastAPI conventions, standard response envelope, route registration
- API middleware skill — the registration pipeline, middleware ordering
- Backend-dev agent — knows this is a FastAPI project, knows the service patterns
- New-endpoint prompt — step-by-step: check existing health patterns, register route, write factory-based test, update middleware
What it produces:
- A FastAPI route using the project’s standard response envelope
- A health check that delegates to
HealthChecker, which already knows how to verify database, cache, and queue connectivity - Registration in the middleware pipeline via the standard pattern
- A test using the project’s factory functions, following the naming convention, respecting the fixture hierarchy
- No review comments about conventions, because the conventions were in the context
The difference is not the model. In the case documented in this book, the difference is 150 lines of markdown distributed across 8 files.
10.7.3 What the numbers look like
In the author’s experience across projects that have undergone this transformation, including the reference case documented throughout this book:
| Metric | Uninstrumented | Instrumented |
|---|---|---|
| Convention-violating outputs | 40-60% of generated code | Under 10% |
| Review comments per agent PR | 4-8 (“we don’t do it that way”) | 0-2 (substantive, not stylistic) |
| Agent-generated code requiring rewrite | 30-50% | Under 15% |
| Time from agent output to merge | Hours (review + rework) | Minutes (spot-check) |
- These numbers are directional, not guaranteed; they depend on the quality of your context files and the complexity of your conventions. The direction is consistent.4
-
Based on the author’s experience across instrumented projects. Results will vary by codebase complexity, model, and instrumentation maturity.
10.8 The Feedback Loop
Instrumentation is not a one-time setup. It is a continuous practice, like testing.
When an agent produces incorrect output, the diagnosis follows a consistent pattern:
Failure observed
|
v
Root cause: which context file failed?
|
+-- Agent too generic? --> Add domain knowledge to agent config
+-- Skill rules incomplete? --> Add the missing case to the skill
+-- Instructions missing scope? --> Add a scoped instruction file
+-- No decision framework? --> Extract a new skill
+-- Context gap? --> Update the memory file
+-- No repeatable process? --> Create a prompt
Four examples from a real project, where fixing the instrumentation file fixed the class of error permanently:
| Failure | Root cause | Context fix |
|---|---|---|
Agent used _rich_info() directly instead of logger.progress() |
Skill didn’t explicitly ban direct calls | Added “never call _rich_* directly in commands” to CLI skill |
| Agent invented a new collision detection pattern | Instructions didn’t list all base-class methods | Added “use, don’t reimplement” table to integrator instructions |
| Agent produced inconsistent Unicode symbols in output | No single source of truth for status symbols | Created STATUS_SYMBOLS reference in skill, added to anti-patterns |
Agent used deprecated SessionAuth in new code |
Memory file didn’t record the deprecation | Added deprecation notice with migration tracking reference |
This feedback loop is how context file quality compounds over time. Every failure you diagnose and fix is a failure that never recurs. After 20-30 iterations, your instrumentation set covers the conventions that actually matter, not the ones you theorized about, but the ones agents actually violate. That practical grounding is what makes an instrumented codebase effective.
10.9 What It Looks Like: An Annotated Session
The following excerpts are from a real Copilot CLI session — not a reconstruction, not a sanitized demo. They are lightly trimmed for length but not edited for voice. The typos and urgency are part of the record.
d89b3ccc
Repository: microsoft/apm · Branch: feat/auth-logging-architecture-393 Result: PR #394 — 75 files changed · Duration: 207 turns across 2 sessions · Full metrics
Turn 0 — The messy start. The session opens with the developer pasting raw terminal output of an authentication failure and role-casting the agent in a single breath. This is the Architect role from Chapter 8 — not writing a plan document, but framing the problem and the expertise needed to solve it:
“So I try to install a package from a GitHub EMU organization (requires auth) and it goes as:
apm install mcaps-microsoft/mcaps-software-poya#v1.0.0Validating 1 package(s)… ✗ mcaps-microsoft/mcaps-software-poya#v1.0.0 - not accessible or doesn’t exist […] Can you think as a world class UX expert and work with an APM logging mechanism implementation expert on what is going on here and how to fix the experience as a whole”
No YAML plan. No schema. A bug report, a role assignment, and a quality bar — all in natural language.
Turn 3 — Fleet deployment in six words. Once the agent drafted a plan, the developer’s entire orchestration command was:
“Fleet deployed: create an issue to track this Epic and then implement on a new branch”
The agent responded with 17 of 25 planned tasks completed, a GitHub issue created, and a feature branch pushed. Contrast this with the fabricated copilot-cli dispatch --scope commands that populate synthetic demos. Real orchestration is natural language with clear intent, not CLI flags.
Turns 4–5 — The feedback loop catches a silent semantic failure. The agent’s initial plan assumed EMU (Enterprise Managed Users) meant *.ghe.com hosts — a reasonable inference from the documentation, and one no test would flag. The developer caught it:
“Note that in the docs the doc writer seems to think EMU = *.ghe.com. But that’s not the only case. A github.com instance may be an EMU. *.ghe.com is for the Data Resident EMU offering. Hopefully this does not reflect faulty auth code and cases.”
One turn later, the correction sharpened into an instrumentation fix — not just correcting the code, but ordering the context file updated:
“Sorry but I do not think ‘ghu_’ is identifying EMUs. I have a ‘github_pat_’ starting token and it comes from an EMU, it’s fine grained. It’s rather ghp (classic PAT) and github_pat (fine grained pat) prefixes. ‘ghu_’ is for OAuth User Tokens. This is critical knowledge for the auth expert agent.md to be updated.”
That last sentence is this chapter’s entire thesis in twelve words. The developer didn’t just fix the code — they fixed the context file so every future agent inherits the correction. This is the Silent Semantic Failure anti-pattern from Chapter 18 caught and resolved at the primitive layer.
Turn 23 — Multi-agent debugging with real output. After initial fixes, auth still failed. The developer invoked the agent team and pasted verbose terminal output showing the full authentication resolution chain:
“It’s better but auth still does not work. Please work together with the agent team (auth expert, logging expert, doc expert) find the rca”
The agent traced the failure through real debug logs: Auth resolved → Trying unauthenticated → failed → retrying with token → 403 → trying credential fill → failed. Root cause: the x-access-token:{token}@host URL format sends Basic auth, which GitHub rejects with a 403 for fine-grained PATs. The fix was to switch to git -c http.extraHeader='Authorization: Bearer {token}'. Verbose instrumentation made the failure chain legible. Without it, the 403 would have been a mystery.
Turn 50 — Escalation to named agents. When single-agent debugging hit diminishing returns, the developer restructured the working group:
“considering the implementation you’ve already started, and realizing you’ve been hitting roadblocks, I want you now to work in plan mode with our custom .agent.md software architect and the logging expert with the related skill. These are 2 subagents that will work with you on the planning task.”
This is instrumentation at the orchestration layer. The developer didn’t write more code or more tests — they changed which agents were in the room and how they coordinated. The custom .agent.md files (described below) gave each agent persistent domain knowledge that outlives the session.
Turn 71 — Granular review, then delegation. The developer shifted into the Reviewer role (Chapter 8) at its most specific, then handed off to specialists:
“you are printing the hashes twice? […] And if we are in verbose mode, why aren’t you printing live in the logs the 22 files we skipped? […] I’ll let the logging expert and UX expert on world class logs plan and decide. This expert agent needs to look at each of these commands with a team of clone subagents working for him and assess whether this is really killer full verbose logs”
The pattern: review at the level of individual log lines, then delegate the fix to an agent whose .agent.md defines the quality bar. The developer sets the standard; the agents implement it.
Turn 73 — The instrumentation payoff. This is the moment that distinguishes a good session from a lasting one:
“you must ensure you update the logging skill so that such architectural concerns and patterns are enshrined — not as unmovable, but as current art and baseline”
Not a code fix. A primitive fix. The developer updated the logging skill file so that every future agent working on logging — in this session, in next month’s session, by a different developer — inherits the patterns discovered here. One fix, permanent prevention. This is the feedback loop described earlier in this chapter, operating at production scale.
What the agent files look like. The .agent.md files referenced throughout this session are not abstractions. Here is the auth expert, created during the session and refined across turns 4–5:
---
name: auth-expert
description: >-
Expert on GitHub authentication, EMU, GHE, ADO, and APM's
AuthResolver architecture. Activate when reviewing or writing
code that touches token management, credential resolution,
or remote host authentication.
model: claude-opus-4.6
---# Auth Expert
You are an expert on Git hosting authentication across
GitHub.com, GitHub Enterprise (*.ghe.com, GHES), Azure DevOps,
and generic Git hosts.
## Core Knowledge
- **Token prefixes**: Fine-grained PATs (`github_pat_`),
classic PATs (`ghp_`), OAuth user tokens (`ghu_`)...
- **EMU**: Use standard PAT prefixes. No special prefix —
it's a property of the account, not the token.
- **Host classification**: github.com (public), *.ghe.com
(no public repos), GHES, ADOThis is what “update the auth expert agent.md” means in practice. The domain correction from Turn 5 — that EMU tokens use standard PAT prefixes, not a special ghu_ prefix — lives here permanently. Every agent that activates this file inherits that knowledge without being told twice.
Midway through the session, the developer paused and recognized the pattern:
“Look at the planning and agent teams orchestration pattern we used, with waves, with task dependencies, with checkpoints that include panel discussions […] with the specific skills and custom agent files used to instantiate the different flavor of specialized subagents […] This should become a handbook for AI Engineers to leverage.”
This book was partly born from the patterns it describes. By Turn 79, the developer applied the same orchestration structure — specialized agents, panel review, escalation gates — to set up a documentation team for this handbook itself. The pattern is fractal: it works for code, for knowledge work, for any domain where quality requires multiple perspectives and persistent context.
Three things to notice.
The developer’s value was domain knowledge, not keystrokes. Across 207 turns and 75 changed files, the developer wrote no application code. Their contributions were the problem frame (Turn 0), the domain corrections (Turns 4–5), the team composition decisions (Turns 50, 71), and the context file fixes (Turns 5, 73). This is the role shift Chapter 8 described: architect, reviewer, escalation handler — operating through instrumentation rather than implementation.
Every correction became a permanent artifact. A developer without instrumentation would have fixed the auth code and moved on. The same EMU confusion would recur next month, with a different agent, on a different task. Instead, the correction went into the auth-expert.agent.md (Turn 5) and the logging skill (Turn 73) — context files that prevent the class of error, not just the instance. This is the feedback loop that makes instrumentation compound over time.
The hardest bugs were semantic, not syntactic. The agent’s assumption that EMU means *.ghe.com was plausible, consistent with documentation, and wrong. No linter catches that. No test catches it unless the test author already knows the correct answer — in which case you don’t need the test. This is the Silent Semantic Failure anti-pattern Chapter 18 catalogues. The safety net was a human with domain expertise reviewing agent output with enough context (verbose logs, Turn 23) to see where the reasoning broke. Agent-augmented development does not eliminate the need for human judgment. It changes where that judgment is applied — and instrumentation determines whether it has lasting effect.
10.10 Starting Points
The seven primitives and the full directory structure represent a mature instrumented codebase. You don’t need to build all of it at once. Start where the impact is highest.
Week one. Three files: global instructions (10-15 lines of non-negotiable principles), one scoped instruction file for your most-edited module, one agent configuration for the task agents perform most often.
Week two. Use these files on real work. When the agent violates a convention not covered by your files, add it. When it does something right that surprised you, check whether your instrumentation contributed. Update the memory file with the decisions and trade-offs you resolved this week.
Week three. Extract the first skill — you’ll know it’s time when you’ve written the same guidance in two different instruction files. Package the shared knowledge as a skill with a decision framework. Create your first prompt file for a task you’ve now asked an agent to do three or more times.
Ongoing. Review context files monthly. Remove rules that never trigger. Tighten rules that trigger but don’t prevent the failure. Add new rules only in response to observed failures. Treat your instrumentation like a test suite: it should grow with the codebase, stay accurate, and never contain dead rules.
The instrumented codebase is not a finished state. It is a practice — an ongoing investment in making machine-readable what your team already knows. The directory structure and primitive types here can be built by hand, and for a first project, doing so builds understanding. For subsequent projects, or for teams standardising across repositories, the mechanical work of scaffolding and sharing benefits from a distribution mechanism — the same pattern npm brought to JavaScript modules. This category of tooling is new. One open-source implementation is APM (Agent Package Manager), built by the author of this book; you can build everything in this chapter without it, but tooling reduces scaffolding from hours to seconds.
Shortcut: scaffolding with a package manager
apm initgeneratescopilot-instructions.md, one scoped instruction file, and one agent configuration — the Week One starter set from this chapter.apm installpulls shared context files from any Git repository into your project.
10.11 What this chapter unlocks
You now have the catalogue. Three chapters follow, each treating one face of the same artifact:
Chapter 11 — The PROSE Specification (Chapter 11). How to write the prose inside these files so it actually steers the model. Each PROSE constraint — Progressive Disclosure, Reduced Scope, Orchestrated Composition, Specification, Explicit Hierarchy — is a normative rule for the body of a primitive. Chapter 11 is the writing chapter; this one was the typing chapter.
Chapter 12 — The Load Lifecycle (Chapter 12). How the harness loads these files. The five load modes named in this chapter — eager preload, lazy on-demand, dispatcher-mediated, user-invoked, event-driven — get their mechanics there: load order, transitive closure, the dispatcher’s binding window, why a correctly-placed skill stays silent when a parent instruction overflows the context budget. Chapter 12 is the chapter you reach for when a primitive does not bind.
Chapter 13 — Attention and Context Economy (Chapter 13). What these files cost once loaded. Loading is deterministic; attention is not. A 40-line instruction file and a 400-line one bind identically and behave nothing alike. Chapter 13 is the physics — why context is a working set rather than a budget, why doubling input length more than halves output quality past a threshold, why progressive disclosure is the price of admission rather than polish.
This chapter showed you what to build. Chapter 11 shows you how to write it. Chapter 12 shows you when it loads. Chapter 13 shows you what it costs.
agentskills.io is the open registry standard for the
SKILL.mdentrypoint with description-driven activation. The same standard has been adopted in substantially compatible form by GitHub Copilot (.github/skills/<name>/SKILL.md), Claude Code (.claude/skills/<name>/SKILL.md), and several others — skills port across harnesses more cleanly than scope-attached rules because the standard pinned both the file name and the activation contract.↩︎Genesis’s
skills/genesis/SKILL.mdis explicit about its scope under the “Hard rules” heading: “The output of this skill is DESIGN ARTIFACTS, not finished modules. A separate coding step emits the natural-language modules from the artifacts.” That is the design-test discipline applied to design itself — the activation contract this section teaches, drawn at the boundary I had to set before the skill could ship. Agent-side reference; activation contract verbatim.↩︎The principle of scoped, hierarchical configuration mirrors established patterns in software engineering — from CSS specificity to Git’s nested .gitignore files. The key insight is that agents benefit from the same locality-of-reference that makes these patterns effective for humans.↩︎
Static analysis tools (ESLint, Pylint, Ruff) catch syntactic violations but not semantic ones — an agent can pass all linters while violating domain conventions. The instruction files complement, rather than replace, automated linting. See also the discussion of Silent Semantic Failure in Chapter 18.↩︎