Demystifying Agenting Coding Harnesses
A coding harness is the program around a language model that turns a stateless text-to-text function into something that can edit files, run commands, and pick up where you left off an hour ago. Strip the branding from Claude Code, Codex CLI, Aider, or Cursor and what’s left is the same small loop, a list of tools, a curated context window, and a permission layer between the model and your filesystem. This post walks through that machinery from the inside.
1. The LLM is a function, not an agent
A language model, by itself, is a stateless function from text to text. You hand it a sequence of tokens, it gives you back a probability distribution over the next token. You sample one, append it, ask again, and that is the entire primitive. There is no filesystem inside it, no clock, no network, and nothing carried between calls.
The “chat” interface most people use is a thin wrapper around that loop. Your conversation gets serialized into one long string, the model produces the assistant’s reply, and the wrapper hands it back. The API server does not remember anything between requests. Whatever the model appears to “know” about your conversation is bytes you uploaded with this call.
Nothing about that primitive can read your files, run a command, look at your git status, or notice the wall clock. So when a coding assistant edits your codebase, runs your tests, and remembers what you were debugging an hour ago, none of that is the model. It is the program around the model. That program is the harness.
2. The agentic loop
An agent is what you get when you put the LLM in a loop and let it call functions inside that loop. Strip the branding from Claude Code, Codex CLI, Aider’s agent mode, or Cursor’s composer and what’s left is the same shape.
The mechanics, end to end:
- You declare a list of tools to the API. Each one is a name, a description, and a JSON schema describing its inputs.
- You call the model with the conversation so far, plus those tool declarations.
- The response can include
tool_useblocks: structured outputs that say “callread_filewithpath='/src/main.ts'”. - Your code parses the
tool_useblocks, runs the corresponding functions on the host machine, and captures their output. - You append a
tool_resultblock to the conversation for each call, and call the model again. - Repeat until the model returns a response with no tool uses (
stop_reason == "end_turn").
About thirty lines of Python is enough for a working agent:
from anthropic import Anthropic
client = Anthropic()
TOOLS = [
{
"name": "read_file",
"description": "Read a file from disk.",
"input_schema": {
"type": "object",
"properties": {"path": {"type": "string"}},
"required": ["path"],
},
},
]
HANDLERS = {
"read_file": lambda path: open(path).read(),
}
def run(user_prompt: str) -> str:
messages = [{"role": "user", "content": user_prompt}]
while True:
response = client.messages.create(
model="claude-opus-4-7",
max_tokens=4096,
system="You are a coding agent. Use the tools to help the user.",
tools=TOOLS,
messages=messages,
)
messages.append({"role": "assistant", "content": response.content})
if response.stop_reason == "end_turn":
return "".join(b.text for b in response.content if b.type == "text")
results = []
for block in response.content:
if block.type == "tool_use":
output = HANDLERS[block.name](**block.input)
results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": str(output),
})
messages.append({"role": "user", "content": results})That loop is the agent. Every coding harness in the wild is a variation on it. Add more tools, a longer system prompt, a permissions check before each HANDLERS[...] call, prettier streaming output, and you have a recognizable product. The substance is the loop; the rest is engineering on top of it.
3. Tools, mechanically
A tool is two unrelated objects glued together by a name. One half is a JSON schema sent to the model. The other half is a function on your machine. The model never sees the function. It sees the schema, the description, and the tool’s name, and uses those alone to decide when to call.
The declaration is plain JSON. Here is one in Anthropic’s tool-use format:
{
"name": "read_file",
"description": "Reads a file from the local filesystem. Use this whenever you need the contents of a file before editing it or referencing specific lines.",
"input_schema": {
"type": "object",
"properties": {
"path": {
"type": "string",
"description": "Absolute path to the file. Must not be a directory."
}
},
"required": ["path"]
}
}The description does more work than it looks. It is several hundred tokens (sometimes several thousand) of natural-language documentation that the model reads on every call. It shapes whether the model picks this tool, when it picks it, and how it fills in arguments. A tool whose description says “use Read instead of cat” is not a model that was trained to prefer Read. That preference is a runtime instruction the harness author has pinned to the tool, and a different harness with the same model would behave differently.
The handler is just code. Here is a sketch of Read:
def read_handler(path: str, offset: int = 0, limit: int = 2000) -> str:
with open(path, "r") as f:
lines = f.readlines()
selected = lines[offset : offset + limit]
return "".join(
f"{offset + i + 1}\t{line}" for i, line in enumerate(selected)
)The leading <line-number>\t prefix on each line is not a quirk of the API; it is a deliberate choice. When the model later wants to edit a file, line numbers make it easier to locate the right region and report the change. The handler is shaping the model’s behavior by shaping its inputs.
Bash is the simplest in spirit and the most consequential in practice:
def bash_handler(command: str, timeout_ms: int = 120_000) -> dict:
proc = subprocess.run(
command,
shell=True,
capture_output=True,
text=True,
timeout=timeout_ms / 1000,
)
out = proc.stdout + proc.stderr
if len(out) > 30_000:
out = out[:30_000] + "\n[output truncated]"
return {"content": out, "exit_code": proc.returncode}It spawns a shell. That is all. Whatever the model puts in command is what runs. The harness is the layer that decides whether to actually call this function.
Edit is interesting because the handler enforces a contract the model could otherwise violate:
_read_files: set[str] = set() # Files the model has seen via Read this session.
def edit_handler(path: str, old_string: str, new_string: str) -> str:
if path not in _read_files:
raise ToolError("file must be read before it can be edited")
content = open(path).read()
occurrences = content.count(old_string)
if occurrences == 0:
raise ToolError("old_string not found in file")
if occurrences > 1:
raise ToolError("old_string is not unique; expand the context")
open(path, "w").write(content.replace(old_string, new_string))
return f"edited {path}"The “you have to Read before you Edit” rule is not policed by the model. It is enforced in the handler. The harness is encoding an invariant in code that would otherwise depend on the model’s good judgment, which is the wrong layer to trust for hard constraints.
So when people talk about a model “having tools”: the harness has tools, the harness exposes their schemas to the model, and the model produces structured outputs that the harness interprets as calls. The tools belong to the host program. The model just describes them well enough to drive them.
4. The context window is all there is
Everything the model knows during a session is one flat list of messages. No hidden state, no semantic search over your codebase, and no memory module running quietly in the background — just the conversation, re-uploaded on every turn.
Roughly what that list looks like a few turns into a Claude Code session:
system: |
You are Claude Code, Anthropic's official CLI for Claude.
<project tree, OS, cwd, git status>
<tool usage policies>
<CLAUDE.md contents pasted in>
messages:
- role: user
content: "add a function to parse the config"
- role: assistant
content:
- { type: text, text: "I'll start by reading the existing parser." }
- { type: tool_use, id: "t1", name: "Read", input: { path: "/src/parser.ts" } }
- role: user
content:
- { type: tool_result, tool_use_id: "t1", content: "1\timport ..." }
- role: assistant
content:
- { type: text, text: "Found the entrypoint. Editing now." }
- { type: tool_use, id: "t2", name: "Edit", input: { ... } }
- role: user
content:
- { type: tool_result, tool_use_id: "t2", content: "edited /src/parser.ts" }That is the entire memory of the session. When the next turn fires, the whole structure gets uploaded again as the input to the model.
Most things that feel like state in a coding harness are just text in this list:
CLAUDE.mdand [[3-Topics/AI & ML/A Field Guide to Repository Context Files/Index|project memory files]]. Read once at session start, pasted into the system prompt. The model does not “consult” the file. It already has it, sitting in front of every call.- Todo lists. A tool (
TodoWrite) whose handler writes a structured snippet into the conversation log and returns it back. The model “sees its own todos” on every turn because it re-reads the log. - Apparent long-term memory of the conversation. “You said five minutes ago that…” works because that text is sitting up there as a
usermessage, still in scope. - Subagent results. The subagent runs an entire conversation of its own, then returns a single string. That string becomes one
tool_resultblock in the parent’s message list. The parent has no idea how long the subagent ran or how many tools it called.
The implication is structural. The harness author’s job, more than anything else, is deciding what goes in the context window and when: which tool outputs pass through verbatim, which get summarized, which old turns are still load-bearing, and which can be dropped. The harness is doing context curation on every turn, and most of the perceived intelligence of the product is a consequence of those choices.
This is also why the context-window sizes the marketing talks about (200k, 1M) matter more than the model’s raw capability for most coding tasks. The window is the entire address space the model can reference. Run out of room and either the harness compacts (lossy), or the session falls over. §7 returns to this.
5. Permissions, sandboxes, and human-in-the-loop
A tool is a function on your machine. Once the model can call it, the model can also call it badly. The harness is what stands between “the model proposed rm -rf ~” and the actual filesystem call.
Three layers, going outward.
Tool gating per call. Before the harness executes a tool_use block, it consults a policy. The policy can auto-allow, auto-deny, or interrupt the user with an approval prompt. Claude Code’s “plan”, “default”, “accept-edits”, and “bypass” modes are progressively looser versions of this gate. The model itself never executes anything: it proposes, the harness decides. Even in “bypass” mode, the gate is still there. It just always answers yes.
Permission policies. Rules that match against tool name and arguments. “Auto-allow git status. Prompt on git push. Deny git push --force to main.” Stored per-project, per-user, or session-only. The policy is a small DSL the harness has to design carefully, because the rules need to be expressive enough to let the agent actually work without becoming a separate full-time configuration job.
Sandboxes. If you do not trust the harness’s gating to be airtight, you put the host process itself in a box. A container, a separate user account, seatbelt or Landlock, a fresh git worktree on a throwaway branch. The sandbox catches the case where the model is calling correctly-gated tools but doing destructive things within the allowed surface, like deleting non-vendored files inside a directory you said it could write to.
The LLM is an untrusted, eager intern with shell access. Every protection layer is downstream of that fact. If you would not give the intern unrestricted bash, do not give the model unrestricted Bash, and a reasonable harness gives you the knobs to say so.
The model cannot do anything between turns. That is the detail that makes the approval prompt a side channel rather than an interrupt. When the harness pauses to ask “run git push?”, it is not stopping a running agent, it is waiting on the user with no other process to interrupt. The instant approval comes back, the harness either runs the tool and feeds the result into the next API call, or it feeds back a “user denied” tool_result and lets the model recover. The agent is reactive at its core. The human is the one deciding the cadence.
6. Subagents are recursion
A subagent is a tool whose handler runs another full instance of the loop, with a fresh conversation, its own system prompt, and a (possibly different) set of tools. The subagent runs to completion on its own and returns a single string to the parent, which receives it as one ordinary tool_result.
The handler is structurally a copy of the main loop, scoped to a fresh message list:
def task_handler(prompt: str, subagent_type: str) -> str:
sub_messages = [{"role": "user", "content": prompt}]
sub_system = AGENT_SYSTEM_PROMPTS[subagent_type]
sub_tools = AGENT_TOOLS[subagent_type]
while True:
response = client.messages.create(
model="claude-opus-4-7",
max_tokens=4096,
system=sub_system,
tools=sub_tools,
messages=sub_messages,
)
sub_messages.append({"role": "assistant", "content": response.content})
if response.stop_reason == "end_turn":
return "".join(b.text for b in response.content if b.type == "text")
results = [
execute_with_permissions(block)
for block in response.content
if block.type == "tool_use"
]
sub_messages.append({"role": "user", "content": results})Inside, it is the same loop from §2. The recursion buys three things:
Context isolation. A code-exploration task can easily fire sixty tool calls and pull two hundred thousand tokens of file content into its context. If the parent only needs a three-paragraph summary of “here is the architecture”, carrying the whole exploration history forward is dead weight. The subagent does the search in its own context window, returns a small string, and the entire exploration log is garbage-collected with the subagent’s stack. The parent’s context stays clean.
Parallelism. Subagents are independent processes from the harness’s perspective. Spawn three at once, await all three, fold the strings back. Useful when the work decomposes: investigate this file while reviewing that one while running the test suite. The cost is real (three concurrent API streams, three loops), but the wall-clock savings can be large.
Specialization. Different system prompts, different tool subsets. A code-review subagent can be configured with read-only tools and a prompt that biases it toward identifying issues over fixing them. An implementer subagent gets write access and a prompt that biases toward shipping. The library of agent types — code-explorer, general-purpose, statusline-setup, and so on — is the harness’s vocabulary for “here is the kind of work I can hand off”.
What subagents do not buy you is shared state. The child sees only what is in its prompt, and the parent sees only the returned string. There is no IPC, no shared variable, nothing like a mutex. If you want the child to know something, you put it in the prompt. If you want the parent to know something, you put it in the response. Anything else is leaking abstraction.
7. Context engineering: caching, compaction, and the 5-minute TTL
A coding session can run for hours and reach hundreds of thousands of tokens. Re-uploading that conversation on every turn, naively, would be slow and expensive enough to make the product unusable. Harnesses solve this in two structural ways: cache long prefixes on the API side, and compact the conversation when it grows past a threshold.
Prompt caching. When you mark a section of the request with cache_control, the API hashes that prefix and stores the model’s internal state for it, keyed by the hash. Subsequent requests with the same prefix get a cache hit: the server reuses the cached state and only computes the new tokens. The cost difference is roughly an order of magnitude. The latency difference is large enough that you can feel it.
The Anthropic prompt cache currently has a five-minute TTL from last hit. That number sounds like a footnote and is actually a load-bearing constraint on how harnesses are designed. It implies two practical rules:
- Keep the long-lived parts of the request stable across turns. System prompt, tool declarations, project memory, the early messages of the conversation: same bytes every time, in the same order. Mutate only the tail.
- If your harness does any kind of background work, time it against the TTL. Sleep for four minutes and you stay warm. Sleep for six and you eat a cache miss. The dead zone right around five minutes is the worst of both: you pay the miss without amortizing it. Claude Code’s own scheduling helpers spell this out explicitly, recommending either sub-270-second polls or 1200-second-plus long waits.
Compaction. When the conversation crosses some token threshold (Claude Code triggers it at a percentage of the context window), the harness asks the model to summarize the older portion of the conversation and replaces it with the summary in place. Old tool results, off-topic exchanges and stale exploration get compressed into a few paragraphs. The summary is just text in the message list, marked as such, followed by the most recent N turns intact.
Compaction is lossy by definition, and the lossiness shows up in concrete ways. A harness that compacts aggressively can lose the file path it was about to edit, the variable name it was tracking, the exact error message it was debugging. A harness that doesn’t compact at all eventually hits the context limit and falls over mid-tool-call. The tradeoff between aggression and fidelity is a delicate engineering choice, and it usually shows up in user-facing behavior before it shows up in any settings panel.
The combination is what makes a session feel continuous past the model’s nominal context limit. The user perceives “Claude remembered what we were doing two hours ago”. The reality is a compacted summary of the early session, a tail of recent turns intact, and a cache hit on the front of the request. The illusion of memory is a stack of small, deterministic choices about which bytes to keep, which to compress, and which to pin.
8. What makes a harness good
The loop is trivial. The thirty lines in §2 give you a working agent that reads files, runs commands, and edits code. What makes Claude Code or Codex CLI a better product is everything around the loop.
What moves the needle:
Tool descriptions. The highest-impact knob in the whole system. A tool with a fifty-word description gets called erratically: too often, too rarely, with wrong arguments. The same tool with three hundred well-chosen words about when to use it, what to avoid, and how to format inputs gets called well. Claude Code’s Bash description runs to several hundred lines, much of it about which dedicated tool to prefer instead. That is not bureaucracy. It is the documented behavior of the product encoded where the model can actually see it.
[[3-Topics/AI & ML/A Guide to Effective LLM Prompting/Index|System prompt]]. Where the harness teaches the model how to behave in this particular product. Voice, output format, when to clarify versus proceed, which tools to prefer, when to use subagents. “Don’t add comments unless asked”, “after editing a UI component, take a screenshot in the browser”, “check Read before Edit” all live here. The system prompt is the product’s personality, written in English.
Permission UX. A harness that asks the wrong questions makes scary things feel safe and safe things feel exhausting. Good permission UX is mostly about the smallest interruption that still gives the user real control: scope (allow once, allow for the session, allow always), match precision (this exact command versus this command family), and recoverability (denied tool calls should let the model adapt, not crash).
Subagent affordances. A library of pre-defined agent types, each with a curated system prompt and tool subset, gives the parent a vocabulary for “here is a unit of work I can hand off”. A harness with one generic subagent and a harness with twelve specialized ones are different products even though the underlying recursion is identical.
Context engineering. Cache anchoring, compaction strategy, what goes in the system prompt versus the first user message, when to inject git status, when to refresh project memory. Invisible until it isn’t, and then you notice the session getting confused, expensive, or slow.
Tool quality. Edit refusing to operate on a file the model has not read prevents a whole class of error. Read returning line-numbered output makes Edit more reliable. Each tool is a small piece of interface design between the model and the host machine, and the cumulative quality of those interfaces is what makes the harness feel “smart”.
The model. None of the above matters if the model can’t follow instructions, can’t plan over multiple steps, or can’t reason about its own tool calls. A great harness on a weak model is a frustrating product. The same harness on a strong model is the thing you keep open all day.
Coding harnesses aren’t simple. The surface area you imagined as one indivisible magical thing decomposes into a stack of small, comprehensible engineering choices. The loop is the boring part. The interface design around the loop is where the work goes, and where the differences between products live.