
Introduction to Harness Engineering: Improving AI Agent Quality Through Environment Design
Introduction
Harness engineering is a discipline for improving AI agent output quality and reproducibility through structural environment design. Rather than crafting better prompts or optimizing context windows, it focuses on designing the environment itself — rules, skills, hooks, memory, and feedback loops — to achieve more consistent, high-quality outputs.
Key Takeaways
- Understand one useful framing: the three stages of Prompt → Context → Harness Engineering
- Learn the 5 harness components: rules, skills, hooks, memory, and feedback
- Explore OpenAI's "0 human-written lines, 1M lines generated" experiment
- Get a practical adoption roadmap you can start tomorrow
From late 2025 into 2026, OpenAI, Anthropic, and countless development teams converged on the same conclusion: harness design matters more than model selection for agent output quality. Research on SWE-bench results (e.g., Particula.tech) suggests that pass rates can vary significantly based on scaffolding design alone, even when using the same model.
Why Aren't Prompts and Context Enough?
One useful framing is that AI collaboration has evolved through three stages. Each stage builds upon the last — like constructing a building with foundations (prompts), walls (context), and a roof (harness).
Stage 1: Prompt Engineering (2022–2024)
Focused on "how to ask." Few-shot prompting, Chain-of-Thought, role-playing, and other techniques for eliciting better responses from LLMs. Still effective for individual use and simple tasks, but insufficient at scale.
Stage 2: Context Engineering (2025–)
Focused on "what to show." Andrej Karpathy described it in June 2025:
Context engineering is the delicate art and science of filling the context window with just the right information for the next step.
This encompasses task descriptions, few-shot examples, RAG, tool definitions, and state management. CLAUDE.md and AGENTS.md are prime examples of context engineering artifacts.
Stage 3: Harness Engineering (Late 2025–)
Focused on "what environment to create." Named after the full set of horse tack ("harness"), it refers to the totality of rules, constraints, checks, tests, tools, memory, and safety mechanisms surrounding an AI agent. The term "harness engineering" is not attributed to any single company or individual — it emerged organically across multiple communities in late 2025. Martin Fowler notes "the term has emerged," and Mitchell Hashimoto wrote "I'm not sure if this is a widely used term."
The critical difference from context engineering is enforcement. Writing "please run tests before committing" in CLAUDE.md is a request. In harness engineering, a hook automatically runs tests before every commit and blocks the operation on failure.
What Are the 5 Components of a Harness?
Martin Fowler categorizes harness controls into feedforward (guides that steer before action) and feedback (sensors that correct after action). Within this framework, a harness consists of five elements:
| Component | Function | Control Type | Example |
|---|---|---|---|
| Rules | Behavioral guidelines | Feedforward | src/domain/ must not import external libraries |
| Skills | Reusable procedures | Feedforward | /write-test for standardized test generation |
| Hooks | Event-driven triggers | Feedback | Auto-format and type-check on file save |
| Memory | Cross-session persistence | Feedforward | progress.md tracking design decisions |
| Feedback | Multi-layered verification | Feedback | Type check → lint → test → structural test |
Rules: Codify Behavioral Standards
Written in CLAUDE.md or .claude/rules/. According to HumanLayer's blog, CLAUDE.md should be kept under 60 lines, focusing on project-specific rules. Generic best practices that models already know provide little additional value.
Skills: Make Procedures Reusable
When you catch yourself giving the same instructions repeatedly, it's time to extract a skill. In Claude Code, you can define custom slash commands in .claude/commands/*.md or skills in .claude/skills/<name>/SKILL.md, both invokable via /command-name. Test generation, code review, article creation — any standardized procedure can be shared across a team.
Hooks: Enforce Quality Through Events
Hooks are scripts triggered by agent lifecycle events. In Claude Code, hooks can be attached to events such as PreToolUse (before tool execution), PostToolUse (after tool execution), and Stop (when the response completes). Their enforcement power is the key distinction from rules — they don't ask, they enforce.
Memory: Persist Context Across Sessions
Session amnesia is one of AI agents' greatest weaknesses. When you start a new session, the agent typically only reads CLAUDE.md automatically. Previous progress, design decisions, and context vanish unless you take explicit steps.
Three main approaches exist today:
1. Instruct the agent in CLAUDE.md to read progress files. The simplest approach: add "Before starting work, read progress.md. After completing work, update it." to CLAUDE.md. Simple, but still a request the agent may not always follow.
2. Auto Memory (Claude Code). Claude Code automatically saves memories to ~/.claude/projects/<project>/memory/. The MEMORY.md index (first 200 lines) is loaded at every session start, similar to CLAUDE.md. However, what gets saved is up to the agent's judgment.
3. Force-feed via harness orchestration (Anthropic's research pattern). In Anthropic's research, an external harness script automatically loads claude-progress.txt and JSON feature lists into the agent's context at startup. JSON is preferred because agents are less likely to carelessly overwrite structured data. This requires an external orchestration layer beyond Claude Code itself.
Dealing with context compaction: In Claude Code, long conversations trigger automatic context compression. To prevent the agent from losing track after compaction, write important decisions and progress to external files like progress.md as you go. Files aren't compressed — they can always be re-read.
Feedback: Multi-Layered Automated Verification
Feedback comes in two forms: computational controls (linters, tests — deterministic, fast) and inferential controls (LLM-as-Judge — slower, non-deterministic). Prioritize computational controls and reserve inferential controls for the final stage.
What Can We Learn from OpenAI's "0 Human Lines, 1M Generated" Experiment?
From August 2025 to January 2026, OpenAI conducted a groundbreaking internal experiment led by Ryan Lopopolo. The results were remarkable:
- Human-written code: 0 lines
- Generated code: ~1 million lines
- Merged PRs: ~1,500
- Per engineer: 3.5 PRs/day
- Token consumption: 1B+ tokens/day (according to Latent Space estimates, ~$2,000–3,000/day)
"Manually Fixing Quality" Doesn't Scale
OpenAI first tried "AI Slop Friday" — dedicating 20% of each Friday to manually fixing low-quality code. As the codebase grew, the team couldn't keep up.
The solution: encode quality standards as automated check rules and let another AI agent handle fixes. Instead of humans fixing quality, the system maintains quality — the core idea of harness engineering.
Here are some of the rules OpenAI embedded in their harness, in plain terms:
- Don't reinvent the wheel: Common logic goes into shared libraries, preventing agents from creating duplicates
- Validate data at the entry point: External data is checked at system boundaries so internal code can trust it
- Enforce one-way dependencies: "UI can call Service, but not the other way around" — enforced by automated checks, not instructions
Each pattern replaces "asking the AI every time" with "enforcing it structurally."
Keep Instructions Short, Details Separate
Another key finding: the master instruction file (equivalent to CLAUDE.md) was kept to ~100 lines as a table of contents, with details in separate files. Agents can't use information they haven't loaded, so easy access matters. But cramming everything into one file causes information overload and hurts performance.
How Does Anthropic's Harness Design Differ?
Anthropic published two engineering blog posts presenting distinct harness architectures.
Pattern A: Two-Agent Division of Labor (November 2025)
When handing off work between team members, leaving behind notes and checklists helps the next person get up to speed quickly. This pattern applies that idea to AI agents, splitting work between two:
Initializer Agent (runs once at project start):
- Creates a dev server startup script (
init.sh) - Creates a work log (
claude-progress.txt) - Creates a feature requirements list (
feature_list.json) — a JSON file with 200+ items describing needed features, each with verification steps and a completion flag
Coding Agent (runs at every subsequent session):
- Reads
claude-progress.txtto understand prior progress - Picks the highest-priority incomplete feature from
feature_list.json - Implements → verifies → commits → updates the progress and feature files
The key point is that the two agents don't communicate directly — they hand off through files. The initializer leaves behind "work instructions" that the coding agent reads at the start of every session.
A notable characteristic of this pattern is the absence of detailed specs. The feature list's "description + verification steps" serves as the de facto specification. Anthropic's experiment involved building a Claude.ai clone, so the model already knew "how a chat UI should work," and development could proceed from lightweight specs alone. For unfamiliar business domains where the model lacks existing knowledge, this approach alone would likely be insufficient.
Testing revealed another problem: the coding agent would mark features as "complete" on the checklist without actually verifying the full application worked end-to-end. In other words, "built it, but never confirmed it actually works." The twin issues of "specs too lightweight" and "self-reported verification" led directly to the next pattern.
Pattern B: Three-Agent GAN-Inspired Architecture (March 2026)
To address Pattern A's issues — insufficient specs and weak verification — this design separates planning, development, and evaluation into three independent agents. Like a GAN (Generative Adversarial Network), it pits a "creator" against an "evaluator" to ensure quality:
- Planner: Generates detailed specifications that Pattern A lacked from brief prompts. Clarifies exactly what needs to be built
- Generator: Builds based on Planner's specs. Performs self-evaluation, but defers final judgment to the Evaluator
- Evaluator: Tests as a real user by operating the app in a browser via Playwright MCP. As a separate agent from Generator, it avoids the "being lenient on your own code" problem
These three agents aren't managed by a human switching sessions manually. An orchestrator built with the Claude Agent SDK automatically runs the Planner → Generator ⇄ Evaluator loop. The human provides the initial prompt, and the system then runs autonomously for hours.
The Claude Agent SDK packages the Claude Code CLI's internals (agent loop + tool execution) as a Python/TypeScript library. Unlike interactive CLI usage, you launch Claude agents from your own code via a query() function and orchestrate multiple agents programmatically. Note that similar multi-agent architectures can also be built without the SDK, using Claude Code's skill system (an orchestrator skill calling worker skills).
Inter-agent communication is file-based, like Pattern A: one agent writes a file, another reads it and responds. A distinctive mechanism is "sprint contracts" — before each sprint, Generator and Evaluator agree on what "done" looks like. Generator proposes an implementation plan, Evaluator reviews and negotiates until they reach agreement. This prevents the "creator and evaluator have different standards" problem.
Anthropic's research team highlighted a key principle: "Every component in a harness encodes an assumption about what the model can't do" — meaning harness complexity should decrease as model capabilities improve.
In concrete results, the Game Maker project produced broken output with a simple harness (20 min/$9), but a fully designed harness yielded a functional, polished application (6 hrs/$200).
Practical Guide: Starting Harness Engineering Tomorrow
Mitchell Hashimoto (HashiCorp co-founder) offers a clear rule: "Anytime you find an agent makes a mistake, take the time to engineer a solution such that the agent never makes that mistake again." But pre-optimizing before real failures emerge is counterproductive.
Phased Adoption
Day 1: Create a CLAUDE.md in 30 minutes. Focus on project-specific rules, keep it concise.
Weeks 1–2: Extract skills from instructions you find yourself repeating. /write-test, /review, etc.
Weeks 2–4: Set up hooks for auto-formatting, type-checking, and test execution. Introduce enforcement.
Weeks 2–4: Add "read progress.md before work, update it after" to CLAUDE.md. Use Auto Memory to persist context across sessions.
Ongoing: When the same violation occurs 3 times, escalate the rule's enforcement level (rule description → hook → test).
Caveats
ETH Zurich research found that LLM-generated configuration files actually degraded performance while consuming 20%+ additional tokens. Chroma's research confirmed that model performance tends to decline at longer context lengths. More harness is not always better — aim for necessary and sufficient.
FAQ
Q. What is the difference between harness engineering and context engineering?
A. Context engineering optimizes "what to show" — CLAUDE.md and RAG are typical examples. Harness engineering encompasses context engineering while adding hooks (enforcement), memory (persistence), and feedback (automated verification) — designing the entire environment.
Q. Is harness engineering necessary for small projects?
A. CLAUDE.md (rules) provides value at any scale. Hooks and skills should be introduced when the same problems recur. "Build the mechanism after the failure" is the correct sequence.
Q. What does harness engineering cost?
A. OpenAI's experiment consumed 1B+ tokens/day (Latent Space estimates ~$2,000–3,000/day). Anthropic's Game Maker cost $200/project with a full harness. Costs increase, but it's a trade-off for quality and reproducibility.
Q. Will harness engineering become obsolete as models improve?
A. Partially, yes. As Anthropic notes, each harness component encodes an assumption about model limitations. Harnesses should be simplified as models improve. However, quality enforcement (tests, lints) and memory (cross-session context) will remain necessary for the foreseeable future.
Summary
Harness engineering is a paradigm for improving AI agent output quality and reproducibility through structure rather than requests. Research on SWE-bench results showing significant pass rate variation from scaffolding design alone suggests that harness design can have as much or more impact than model selection.
The key insight is not to design the perfect harness from day one. Observe failures, then build mechanisms to structurally prevent each one — that iterative process is the essence of harness engineering.
At ZenChAIne, we work on designing and optimizing development workflows powered by AI agents. If you're interested in adopting harness engineering, we'd love to hear from you.
References
- OpenAI — Harness engineering: leveraging Codex in an agent-first world
- Martin Fowler — Harness engineering for coding agent users
- Anthropic — Effective harnesses for long-running agents
- Anthropic — Harness design for long-running application development
- Latent Space — Ryan Lopopolo (OpenAI) deep dive
- HumanLayer — Skill Issue: Harness Engineering for Coding Agents
- Particula.tech — Agent Scaffolding Beats Model Upgrades