The agent as compiler

Harness Engineering, the First Frontier

A horse harness is an engineered system: collar, traces, and reins, all tuned to turn raw power into directed work. The discipline is thousands of years old.

Nirant Kasliwal

Scaled Focus · nirantk.com · @nirantk

An engraving of a horse harness: collar, traces, and reins

The original harness

01 / 25

We help agent companies build better agent harnesses.

2026

Helped Kavana cut costs 40% with caching, at 1M chats a day. LiteLLM contributors.

2025

Helped Ragas modernize. Ran search evals with Littlebird.ai.

Now

Optimizing a presentations harness to 10 cents a deck.

Earlier Author, "NLP in Python", 5,000+ copies Built FastEmbed, used by NVIDIA Nemo Guardrails and 3,000+ repos Top 5 GenAI Scientists in India, 2023

Nirant presenting NLP for Indic Languages at Wingify DevFest 2018

NLP for Indic Languages, Wingify DevFest 2018

02 / 25

01

What is a harness?

The code around the model, and why it is the work.

02

Coding agents are compilers.

Why that framing pays off, and what it tells you to build.

03

The loops.

The cycles that run through most harnesses.

03 / 25

A harness reduces the time from when you have a desire to when you have it. It manages a new resource: intelligence.

01

Network

Move the bytes.

02

Compute

Run the work.

03

Storage

Keep the state.

04 · NEW

Intelligence

Decide what to do.

For decades, backend engineering meant managing three resources: network, compute, storage. Intelligence is the fourth. It is metered, it scales and throttles, and it fails in new ways. The harness is how you manage it.

04 / 25

ML technical debt, Sculley et al. 2014: a tiny ML Code box surrounded by infrastructure

Harness engineering: a tiny LLM box surrounded by harness layers

click the figure, or ↑ / ↓, to reveal the agent-era harness

05 / 25

The model is the smallest box, and it should keep getting smaller. Everything that matters is the harness around it. That model + harness = agent.

The bridge

06 / 25

Coding Agents
are Compilers.

An agent compiles human instructions into machine code.

07 / 25

Coding agents are compilers of human instructions into machine code.

The LLM is the IR. An IR is never the product. You run passes over it, then you check what comes out.

08 / 25

From my notes

A code harness runs two layers.

A model, and ordinary code: Python, Rust, famously bash. Line the agent up against a compiler and they map stage for stage. The weights are the middle IR.

All models learn the same things Universal Geometry of Embeddings · arxiv.org/abs/2505.12540

09 / 25

One engine, four products. Only the configuration changes.

Same compiler, different passes. Skills, tools, and the system prompt are the variables. The SDK underneath is held constant.

10 / 25

A compiler is a directed graph of passes, a fixed workflow. A coding harness runs the model and the real language compiler in one tight loop. A skill adds an edge to that graph at runtime: a registered tool, disclosed only when the model needs it, saved as markdown and code.

Dynamic edge

Use a skill.

The model adds the edge only when it needs it. Progressive disclosure keeps it cheap until the skill fires, so one harness covers many kinds of work.

General. Worth it when the work varies.

Static edge

Make it native.

Wire the edge into the harness as plain code, always on. The model makes no choice, so the same input runs the same way every time.

Reliable. Worth it for a fixed workload.

For a specific workload, what a skill does is often better as a native harness property. You trade a skill's dynamic generalization for deterministic reliability.

11 / 25

Loops.

The cycles that run every harness.

12 / 25

01.

Agent loop

the model calls tools until done

The model calls tools until the task is done. It is the tightest cycle, and everyone already ships it.

e.g. /goal · /loop

02.

Verification loop

a grader scores, then feeds back

A grader scores the output against a rubric. If it falls short, it goes back with feedback. The grader can be code or another model.

in test suites, LLM-as-judge

03.

Event-driven loop

a trigger from outside fires it

An event fires and the agent runs on its own. Now it is a component inside a larger system, not a tool you poke.

in Claude Code's dreaming, agent memory

04.

Hill-climbing loop

it points at the passes, between runs

It does not return to a node. It points at the passes: an analysis agent reads the traces and rewrites the harness itself.

in Karpathy's autoresearch, EVO, Meta Harness

13 / 25

What a goal adds to a thread: a goal holds objective, status, budget and usage as durable, thread-scoped state — Automate work · the goal

The meta-harness search loop: an agent reads a filesystem of prior candidates, traces and scores, proposes a new harness, it is evaluated on held-out tasks, logs are stored, and the loop repeats — Automate improvement · meta-harness

Auto harness: a smaller model that synthesizes a custom harness, or the whole policy, can outperform a much larger model, and at lower cost. arxiv.org/abs/2603.03329 · click a diagram to zoom.

14 / 25

We can run loops, but against what targets?

Strong UI from GLM-5.2, a tier below the frontier. Opus competes with Fable, GLM-5.2 does what Opus can. With the right harness, last season's model is enough, and the model becomes a commodity. Clip: @anshuc, GLM-5.2.

15 / 25

Reward is
All You Need.

Own the loop. Commoditize the model.

16 / 25

Requisite variety

The slit width is your reliability ceiling.

An LLM's output variety is effectively unbounded. Reward and verification are the slit it passes through, so they set the shape of what comes out. Widen the slit and the ceiling lifts; narrow it and you just automate being wrong, faster.

17 / 25

In RL terms, outcome rewards versus process rewards. One scores the final result, the other scores the path it took. The output signal is sparse; the trajectory signal is dense.

Score the result

Output-based.

Reward the final result only. Sparse, but it lets you search for the cheapest, most direct trajectory that still lands. A domain expert owns it: they catch the details, and know how a person would do the task.

Optimizes cost and directness.

Score the path

Trajectory-based.

Reward each step of the path. A denser, richer signal that catches where the agent drifts from what the user wanted. A dedicated engineer owns it, reading the agent's trajectory.

Catches deviation from intent.

Most real systems need both: the output check from a domain expert, the trajectory eval from a dedicated engineer. Process reward model survey: arxiv.org/abs/2510.08049.

18 / 25

Specialize a general model into a harness for each environment.

Each one brings its own config and skills, runs, and leaves traces. Those traces improve both the specialized model and the general one.

The general model gains most from diversity, pulling more tasks into its distribution and interpolating the rest.

19 / 25

From my own work · x.com/NirantK

Project Report: Claude Agent SDK with Skills, for Landing Page Generation

Before we swap models, one example of my own. Not an elaborate harness. One small loop: generate, then check, run well. That alone already carries the result. Which sets up the next slide. If the loop carries the quality, how much does the model underneath still matter?

20 / 25

Five mobile landing pages built by Sonnet 4.6, Kimi 2.6, GLM-5, GPT-5 and Minimax-2.7 from one harness with per-model skills

One harness. Five models, each skill-tuned to its strengths. All five ship.

Swap the model and the output holds, because the reward signal and the harness carry the quality. The model is the commodity.

21 / 25

The test ahead

A company should be able to switch out a 'generalist' model without losing the 'company veteran' expertise built into their learning system. This is the key 'test' of your control and sovereignty in the era ahead.

Satya NadellaChairman & CEO, Microsoft · x.com/satyanadella/status/2066182223213293753

22 / 25

Keep the loop running.

Auto-harness, meta-harness, auto-research, whatever its name. Make it run on problems worth solving, often, and correctly. Then define and measure everything else users expect.

We spent decades building software on short text and clicks. Now we can process language. Great power, and the responsibility to nail it.

23 / 25

A Claude interface at 90 percent of the session limit, with the prompt: create a summary so that GPT can understand it clearly

01

Agents compile human input.

The compiler framing tells you where the real work is.

02

The loops that matter sit outside the agent.

Conversion for SaaS, the merge loop for coding. That is loopcraft.

03

Reward makes the model interchangeable.

Find the reward up a loop, feed it back. Swap the model, the output holds.

24 / 25

Thank you

Own the loop.

Reward is all you need.

Nirant Kasliwal Scaled Focus nirantk.com @nirantk

25 / 25