Harness engineering: why the code around the model matters more than the model.
A model on its own can only generate text. It can't read your files, run your tests, or remember what it did an hour ago. The harness is the code around the model that does all of that. This post walks through what goes in a harness, how the pieces fit together, and how three production systems (Claude Code, OpenClaw, and Pydantic AI) actually build one.
The control loop
What does an agent actually do, step by step?
Every agent runs the same loop. The harness reads the current context. The model picks the next action. The harness runs that action and feeds the result back in. We call the three steps Observe, Think, and Act.
Simon Willison's definition is hard to improve on: an agent is a system that runs tools in a loop to achieve a goal. The model owns one step in that loop. The harness owns the other two, plus the rules for when to stop.
Two kinds of checks decide whether the loop keeps going. Computational checks are deterministic. Tests pass. Types check. Schemas validate. Run these first. They are always right and they cost nothing. Inferential checks are semantic. A second model reviews the output. A different agent grades a diff. They are fuzzy, but they catch what tests miss.
Every framework ships some version of this loop. LangGraph builds it from nodes. Pydantic AI hides it inside Agent.run. Claude Code uses an AsyncGenerator. The shape is the consensus.
The filesystem is the database
Where does the agent's durable state actually live?
The most foundational piece of a harness is also the most boring one. It is a directory the agent can read from and write to. The filesystem gives the agent three things. A workspace for code and data. Offload storageso intermediate work doesn't burn context tokens. A coordination surface where multiple agents and humans share state through files.
Add Git on top and you get versioning for free. The agent can track its own progress. It can roll back. It can branch experiments. This is why Claude Code's memory system isn't a vector database. It is Markdown files under ~/.claude/ that you can ls, cat, grep, and git commit.
Sitting on top of the filesystem is the most load-bearing file in any harness: AGENTS.md (sometimes called CLAUDE.md). The harness loads this file into the system prompt at the start of every session. Treat it as a pre-flight checklist, not a style guide. Every line should come from a real failure. Keep it under sixty lines.
You will know your AGENTS.mdis too long when adding a new rule makes the agent worse, not better. The model's attention is finite. A new rule has to steal attention from an existing one.
Hooks enforce what prompts only request
What stops the agent from typing rm -rf in the wrong directory?
A hook is a small script the harness runs at a specific moment. Before a tool call. After a file edit. Before a commit. The point of a hook is to separate the agent was told to do X from the system actually enforces X. The model can forget a rule in a long conversation. A hook can't forget.
In July 2025 a Replit agent ran DROP DATABASEin production despite freeze instructions in the prompt. The model didn't ignore the rule on purpose. It forgot after thousands of tokens of other context. A regex hook on pre_bash would have stopped the command before it ever reached a shell. That is the entire difference between a prompt and a hook.
A good design rule for hooks is silent on success, loud on failure. If the typecheck passes, the agent hears nothing. If it fails, the error is injected into the conversation and the agent fixes it. The happy path costs zero tokens.
Try the demo on the right. Type a destructive command. The pre-bash hook reads the string, matches a pattern, and blocks the call before any process starts. The model never sees the attempt.
Context rot is the silent killer
What happens when the context window fills up?
As the conversation gets longer, the model gets worse. It forgets rules from earlier turns. It hallucinates file paths it never opened. It declares the task done before it actually is. The harness has to manage what stays in context. The model can't do that for itself.
Four strategies cover most of the failure space. Compaction summarizes older turns into a shorter form when the window starts to fill. Tool-output offloading keeps the first and last lines of a noisy log in context and writes the rest to disk. Progressive disclosure loads tools and instructions only when the task actually needs them, instead of dumping them all in at startup. For jobs that span many hours, a full context reset tears the session down and rebuilds it from a short handoff file.
A related problem is context anxiety. As the window fills, agents start wrapping up tasks early. Not because the work is done, but because they sense the conversation is getting long. The Ralph loopis one fix. It intercepts the agent's exit attempt, checks against the original goal, and if the work isn't done, restarts the session with a clean compacted context.
Separation of concerns
Why split the work across multiple agents instead of giving one agent more tools?
Anthropic's engineering team found a pattern that keeps showing up: agents rate their own work too generously. They think they are done when they are not. Self-checking is not enough on its own.
The fix is to separate generation from evaluation. One agent writes the code. A different agent reviews it. The same reason code reviewers are usually not the code's author. Different incentives. Different blind spots. The second agent catches things the first one missed.
Once you have one seam, you can keep going. A planner breaks the goal into steps. One or more generators work in parallel inside isolated Git worktrees. An evaluator grades each one and merges the winner. A human stays in the loop for steps that can't be reversed.
One thing to watch: handoffs between agents are where bugs hide. Every handoff needs a clear protocol. What does each side own? What data crosses the boundary? Otherwise you end up with the same context rot you split the work to avoid.
The ratchet
How does a harness get better over time without a model upgrade?
When an agent makes a mistake, don't just retry. Treat the mistake as a signal. Add a rule, a hook, or both, so the same mistake can't happen again. We call this the ratchet.
Two rules govern the ratchet. The first is only add a constraint after you've seen a real failure. Don't brainstorm rules. Every line in AGENTS.mdshould trace back to something that went wrong. An agent that has never failed at X doesn't need a rule about X.
The second is only remove a constraint when a better model has made it redundant. When a new model generation removes the failure mode, the scaffolding for that failure becomes dead code. Cut it. Otherwise the harness keeps growing forever and starts to slow the model down.
Better models don't make harnesses obsolete. They movethem. The scaffolding for today's weak spot retires when the model improves. New scaffolding takes its place for the next weak spot. Every part of a harness is a bet on something the model can't yet do on its own.
Three real harnesses in the wild
What does harness engineering look like in production code?
The clearest way to understand harness engineering is to read code that already does it. Three production systems take different bets, and each one is worth reading.
Claude Code (Anthropic) is the simplest of the three. A single loop, optimized for terminal responsiveness. The loop yields every token as it arrives so the terminal updates in real time. Memory is just Markdown files. Hooks live in settings.json. The product motto is give your agent a computer, which pushes you to use shell, files, and the web instead of building bespoke tools.
OpenClaw is the open-source platform. Not a tool, but an operating system for agents. It uses a five-layer pipeline so that each layer handles one kind of failure: authentication, rate limits, model fallback, tool dispatch, and execution. Its plugin registry lets hooks change the system prompt itself. Its Agent Control Protocol (ACP) can orchestrate Claude Code, Codex, and Gemini CLI as sub-runtimes. Memory uses both keyword search (BM25) and vector search, with a background pass that consolidates short-term notes into long-term ones.
Pydantic AIbets that the cheapest evaluator is the type system you already pay for. You declare an output schema as a Pydantic model. The agent can't return data that doesn't match. Native support for the Model Context Protocol (MCP) and the Agent-to-Agent (A2A) protocol means tools and other agents plug in without custom glue. If your team already writes typed Python services, this has the lowest activation cost of the three.
The top coding agents like Claude Code, Cursor, Codex, Aider, and Cline look more like each other than their underlying models do. The models are different. The harness patterns are converging. That is the industry agreeing on what actually matters.
The 95% trap
Why do agent pipelines fail more than you would expect?
Each step in a multi-step pipeline succeeds some fraction of the time. End-to-end success is the product of those fractions. If every step in a twenty-step pipeline succeeds 95% of the time, the whole pipeline finishes 36% of the time. Drag the slider on the right to see how quickly it collapses.
The fix is not to ask the model to try harder. The fix is at the harness level. Insert verification gatesbetween steps so one failure can't cascade into the next. Save state at each step so you can retry one step instead of the whole pipeline. Prefer computational checks like tests, type checks, and schema validation over the model's own judgment.
And shorten the pipeline. Every step you can collapse into a deterministic check, or remove entirely, gives you multiplicative reliability. Three steps at 95% is 86% end-to-end. Twenty steps at 95% is 36%. The shape of the curve does most of the work.
Other ways harnesses fail in production. Tool overload:Vercel's v0 team removed 80% of available tools and the agent got measurably better. Self-evaluation bias: covered in chapter 05. Data quality at ingestion: Gartner projects 60% of AI projects abandoned by 2026 over bad data, not bad models.
Eight things to do, in this order.
- 01Start with a messy v0.1You can’t iterate on something that doesn’t exist. Ship a v0.1 that’s embarrassing. Then improve it.
- 02Write AGENTS.md before any codeKeep it under sixty lines. Conventions, package manager, test framework, files the agent should never touch.
- 03Curate tools aggressivelyTen focused tools beat fifty overlapping ones. Every tool’s description costs tokens on every turn.
- 04Wire hooks earlypost_edit runs typecheck. pre_commit runs tests. pre_bash blocks destructive shells. Silent on success.
- 05Separate generation from evaluationAgents rate their own work too generously. Use a second agent, a test suite, or both.
- 06Add observability on day oneYou need to see every tool call, token count, and decision the agent made. You can’t fix what you can’t see.
- 07Plan for long horizonsCompaction, context resets, Ralph loops. If your agent can’t run for thirty minutes, it can’t do real work.
- 08Ratchet relentlesslyEvery failure earns a rule. Every rule traces to a failure. Prune rules when better models make them unnecessary.
The gap between what today's models can do and what you actually see them doing is mostly a harness gap. Improving your harness will outperform waiting for the next model release. Every time.