The agent as compiler

Harness Engineering, the First Frontier

A horse harness is an engineered system: collar, traces, and reins, all tuned to turn raw power into directed work. The discipline is thousands of years old.

Nirant Kasliwal
Scaled Focus · nirantk.com · @nirantk
An engraving of a horse harness: collar, traces, and reins
The original harness
01 / 25
I'm Nirant.
Scaled Focus
We help agent companies build better agent harnesses.
2026
Helped Kavana cut costs 40% with caching, at 1M chats a day. LiteLLM contributors.
2025
Helped Ragas modernize. Ran search evals with Littlebird.ai.
Now
Optimizing a presentations harness to 10 cents a deck.
Earlier Author, "NLP in Python", 5,000+ copies Built FastEmbed, used by NVIDIA Nemo Guardrails and 3,000+ repos Top 5 GenAI Scientists in India, 2023
Nirant presenting NLP for Indic Languages at Wingify DevFest 2018
NLP for Indic Languages, Wingify DevFest 2018
02 / 25
Today.
Two ideas, and the loops
01
What is a harness?
The code around the model, and why it is the work.
02
Coding agents are compilers.
Why that framing pays off, and what it tells you to build.
03
The loops.
The cycles that run through most harnesses.
03 / 25
What is a harness?
The fourth resource
A harness reduces the time from when you have a desire to when you have it. It manages a new resource: intelligence.
01
Network
Move the bytes.
02
Compute
Run the work.
03
Storage
Keep the state.
04 · NEW
Intelligence
Decide what to do.
For decades, backend engineering meant managing three resources: network, compute, storage. Intelligence is the fourth. It is metered, it scales and throttles, and it fails in new ways. The harness is how you manage it.
04 / 25
The shape was always the same.
ML debt, 2014  ↔  harness, today
ML technical debt, Sculley et al. 2014: a tiny ML Code box surrounded by infrastructure Harness engineering: a tiny LLM box surrounded by harness layers
click the figure, or ↑ / ↓, to reveal the agent-era harness
05 / 25

The model is the smallest box, and it should keep getting smaller. Everything that matters is the harness around it. That model + harness = agent.

The bridge
06 / 25

Coding Agents
are Compilers.

An agent compiles human instructions into machine code.
07 / 25
One machine, two names.
Compiler ↔ coding agent
COMPILER AGENT LLMs are an "IR" Source High IR Med IR LLVM IR Machine code Context Plan, explore Generate Output, run Verify
Coding agents are compilers of human instructions into machine code.
The LLM is the IR. An IR is never the product. You run passes over it, then you check what comes out.
08 / 25
SourceHigh IRMed IRLLVM IRMachine CodeTokenization, ParsingType CheckingOptimizationsOptimizationsmessagecontextPlanExplorePlanningGenerateOutputExecuteVerifyCoding AgentCompilerSandbox+RewardCoding Agents are Compilers of Human Instructions to Machine CodeLLMs are an "IR"
From my notes

A code harness runs two layers.

A model, and ordinary code: Python, Rust, famously bash. Line the agent up against a compiler and they map stage for stage. The weights are the middle IR.

All models learn the same things Universal Geometry of Embeddings · arxiv.org/abs/2505.12540
09 / 25
A code harness is extremely reusable.
One engine, four products
Claude Agent SDK held constant config · the only thing that changes SKILLS TOOLS PROMPT Claude Code Claude Design Claude in Chrome Claude Cowork code, refactor shell, files, git pair-programmer layout, visual canvas, components designer web tasks browser, DOM operator office, comms docs, calendar teammate
One engine, four products. Only the configuration changes.
Same compiler, different passes. Skills, tools, and the system prompt are the variables. The SDK underneath is held constant.
10 / 25
A skill is a dynamic edge.
Compilers are DAGs
A compiler is a directed graph of passes, a fixed workflow. A coding harness runs the model and the real language compiler in one tight loop. A skill adds an edge to that graph at runtime: a registered tool, disclosed only when the model needs it, saved as markdown and code.
Dynamic edge
Use a skill.
The model adds the edge only when it needs it. Progressive disclosure keeps it cheap until the skill fires, so one harness covers many kinds of work.
in run out skill
General. Worth it when the work varies.
Static edge
Make it native.
Wire the edge into the harness as plain code, always on. The model makes no choice, so the same input runs the same way every time.
in run native out
Reliable. Worth it for a fixed workload.
For a specific workload, what a skill does is often better as a native harness property. You trade a skill's dynamic generalization for deterministic reliability.
11 / 25

Loops.

The cycles that run every harness.
12 / 25
Common harness loops.
Each one, what it does
01.

Agent loop

the model calls tools until done

The model calls tools until the task is done. It is the tightest cycle, and everyone already ships it.

e.g. /goal · /loop
02.

Verification loop

gen grade

a grader scores, then feeds back

A grader scores the output against a rubric. If it falls short, it goes back with feedback. The grader can be code or another model.

in test suites, LLM-as-judge
03.

Event-driven loop

agent

a trigger from outside fires it

An event fires and the agent runs on its own. Now it is a component inside a larger system, not a tool you poke.

in Claude Code's dreaming, agent memory
04.

Hill-climbing loop

it points at the passes, between runs

It does not return to a node. It points at the passes: an analysis agent reads the traces and rewrites the harness itself.

in Karpathy's autoresearch, EVO, Meta Harness
13 / 25
Auto harness.
Run the work, then improve the harness
Automate work · the goal
What a goal adds to a thread: a goal holds objective, status, budget and usage as durable, thread-scoped state
Automate improvement · meta-harness
The meta-harness search loop: an agent reads a filesystem of prior candidates, traces and scores, proposes a new harness, it is evaluated on held-out tasks, logs are stored, and the loop repeats
Auto harness: a smaller model that synthesizes a custom harness, or the whole policy, can outperform a much larger model, and at lower cost. arxiv.org/abs/2603.03329 · click a diagram to zoom.
14 / 25
Frontier minus one is enough.
In the wild · @anshuc
We can run loops, but against what targets?
Strong UI from GLM-5.2, a tier below the frontier. Opus competes with Fable, GLM-5.2 does what Opus can. With the right harness, last season's model is enough, and the model becomes a commodity. Clip: @anshuc, GLM-5.2.
15 / 25

Reward is
All You Need.

Own the loop. Commoditize the model.
16 / 25
Reward and verification set the output shape.
The verifier sets the cap
VERIFIER model output space, near-infinite variety reliable output
Requisite variety
The slit width is your reliability ceiling.
An LLM's output variety is effectively unbounded. Reward and verification are the slit it passes through, so they set the shape of what comes out. Widen the slit and the ceiling lifts; narrow it and you just automate being wrong, faster.
17 / 25
Two kinds of reward.
Output vs trajectory
In RL terms, outcome rewards versus process rewards. One scores the final result, the other scores the path it took. The output signal is sparse; the trajectory signal is dense.
Score the result
Output-based.
Reward the final result only. Sparse, but it lets you search for the cheapest, most direct trajectory that still lands. A domain expert owns it: they catch the details, and know how a person would do the task.
in out reward
Optimizes cost and directness.
Score the path
Trajectory-based.
Reward each step of the path. A denser, richer signal that catches where the agent drifts from what the user wanted. A dedicated engineer owns it, reading the agent's trajectory.
in out reward, every step
Catches deviation from intent.
Most real systems need both: the output check from a domain expert, the trajectory eval from a dedicated engineer. Process reward model survey: arxiv.org/abs/2510.08049.
18 / 25
Environments are the flywheel.
Diverse traces improve the general model
Specialize a general model into a harness for each environment.
Each one brings its own config and skills, runs, and leaves traces. Those traces improve both the specialized model and the general one.
The general model gains most from diversity, pulling more tasks into its distribution and interpolating the rest.
General model interpolates Coding Support Research Ops specialize: config + skills diverse traces
19 / 25
From my own work · x.com/NirantK
Project Report: Claude Agent SDK with Skills, for Landing Page Generation
Before we swap models, one example of my own. Not an elaborate harness. One small loop: generate, then check, run well. That alone already carries the result. Which sets up the next slide. If the loop carries the quality, how much does the model underneath still matter?
20 / 25
The model is interchangeable.
Same harness, five models
Five mobile landing pages built by Sonnet 4.6, Kimi 2.6, GLM-5, GPT-5 and Minimax-2.7 from one harness with per-model skills
One harness. Five models, each skill-tuned to its strengths. All five ship.
Swap the model and the output holds, because the reward signal and the harness carry the quality. The model is the commodity.
21 / 25
The test ahead

A company should be able to switch out a 'generalist' model without losing the 'company veteran' expertise built into their learning system. This is the key 'test' of your control and sovereignty in the era ahead.

Satya NadellaChairman & CEO, Microsoft · x.com/satyanadella/status/2066182223213293753
22 / 25
Your job.
The engineer in the loop
Keep the loop running.
Auto-harness, meta-harness, auto-research, whatever its name. Make it run on problems worth solving, often, and correctly. Then define and measure everything else users expect.
We spent decades building software on short text and clicks. Now we can process language. Great power, and the responsibility to nail it.
Run Trace Analyze Improve score against reward, then rewrite the harness auto-improve
23 / 25
Three takeaways.
Switch the model, keep the context
A Claude interface at 90 percent of the session limit, with the prompt: create a summary so that GPT can understand it clearly
01
Agents compile human input.
The compiler framing tells you where the real work is.
02
The loops that matter sit outside the agent.
Conversion for SaaS, the merge loop for coding. That is loopcraft.
03
Reward makes the model interchangeable.
Find the reward up a loop, feed it back. Swap the model, the output holds.
24 / 25
Thank you

Own the loop.

Reward is all you need.
Nirant Kasliwal Scaled Focus nirantk.com @nirantk
25 / 25