Agent Harness: The Missing Middle Between Model Reasoning and Real Execution

The first time you build an agent, the problem looks deceptively simple. You have a model. You have a prompt. You add a tool or two. The model calls the tool. The tool returns a result. The model responds. It feels like the hard part is done. Then the agent has to do real work.

It needs to keep track of a multi-step task. It needs to ask before touching something risky. It needs to remember what it already inspected. It needs to read and write files without trampling the working folder. It needs to compact context before the model falls off a token cliff. It needs to expose progress to the user. It needs telemetry. It needs recovery points. It needs a way to continue until the work is actually finished, not merely until the model produced a confident paragraph.

That is where most agent demos quietly become hand-rolled infrastructure. And that is why Agent Harness is interesting.

Not because it gives you a smarter model. It does not. Not because it replaces Microsoft Foundry. It does not. It is interesting because it names and packages the part many teams end up rebuilding badly: the operating layer between model reasoning and real execution.

Microsoft has also started an official Build your own CLAW series with a hands-on first sample. Treat that as the tutorial path. This post is the architecture map where Agent Harness sits, how it relates to Foundry and when I would reach for it.

The missing middle

The simplest model would be:

Sketchnote showing model capability, Agent Harness, and real systems connected through intent, actions, results and traces

On the left, you have model capability. In the Microsoft ecosystem, that might be a model reached through a Microsoft Foundry project endpoint. It could also be another chat client supported by Microsoft Agent Framework.

On the right, you have reality: files, APIs, shells, web search, MCP tools, code execution, reports, approvals, external systems and workflow side effects. The gap in the middle is where agents become trustworthy or dangerous.

If you do not put a deliberate runtime shape there, you end up with glue code. Some function calling here, some memory there, an approval callback nobody can quite explain, a scratch folder, a retry loop, a few logs, and a growing system prompt trying to make the model behave like an operations engineer.

Agent Harness is an attempt to make that middle explicit.

What Agent Harness is

At the time of writing this post, within the current Agent Framework repo, the .NET surface is HarnessAgent. The Python surface is create_harness_agent. The exact APIs are still marked experimental but the direction is clear enough to pay attention to.

The .NET implementation describes HarnessAgent as a pre-configured agent that wraps a ChatClientAgent with function invocation, per-service-call chat history persistence, optional compaction, default context providers and decorators. The Python sample describes create_harness_agent as a factory that builds a pre-configured, batteries-included agent pipeline from a chat client.

That wording matters. Harness is not the model. Harness is not Foundry. Harness is the agent operating scaffold around a model client. At a practical level, it can bring together capabilities such as this:

Sketchnote capability map showing Agent Harness grouping function invocation, tool-call loops, history persistence, context compaction, todos, plan and execute modes, skills, background agents, file-backed memory, controlled working directories, tool approval, auto-approval rules, telemetry and loop evaluators

You can disable or customise these pieces. That is the important part. This is not a black box agent product. It is an opinionated assembly of Agent Framework primitives.

Why this is not just Foundry

This is where the naming can get confusing.

Microsoft Foundry Agent Service is the managed platform layer. It gives you a place to build, deploy, scale and govern agents. It owns cloud concerns such as model access, hosted agents, tool configuration, identity, deployment, evaluation and enterprise controls.

Microsoft Agent Framework is the SDK and application layer. It gives you code-first primitives for agents and workflows such as sessions, tools, middleware, context providers, orchestration, streaming and provider integrations.

Agent Harness belongs to the Agent Framework side of that line.

Foundry can still be central to the story. In fact, many of the current Harness samples are Foundry-backed. A harness agent can use a Foundry model endpoint through FoundryChatClient in Python or AIProjectClient / IChatClient style integration in .NET. A harness-backed application can also be hosted in Foundry as a hosted agent. But the relationship is not "Foundry has Agent Harness" in the same way Foundry has hosted agents or toolboxes. The more accurate framing is:

Agent Harness is an Agent Framework pattern that can run against Foundry models and can fit into Foundry-hosted agent architectures.

That distinction saves you from a bad design decision later.

The Foundry split you need to understand

The Foundry provider docs make an important distinction between two patterns.

The first pattern is direct inference. Your application owns the agent definition. It supplies the model, instructions, tools and runtime behavior in code. This is where Harness fits most naturally because the app owns the loop.

The second pattern is a service-managed Foundry agent. The agent definition lives in Foundry. Its instructions and hosted tools are configured in the Foundry portal or through service APIs. Your application connects to that agent and runs it.

With a service-managed FoundryAgent, you cannot assume that client-side tool changes behave the same way as an app-owned agent. The docs are explicit that the Foundry agent definition is authoritative. Hosted tools must be configured on the Foundry agent. Runtime instructions and tools passed from the client are constrained by that server-side definition. That does not make one pattern better than the other. It means they solve different problems.

Use app-owned Harness when you want the application to compose tools, memory, skills, approval, file access and iteration at runtime. Use service-managed Foundry agents when the enterprise needs versioned agent definitions, portal-managed tool configuration and stronger separation between the app shell and the agent definition. Use Foundry hosted agents when you want to run your Agent Framework code as a managed deployed agent.

Why the capability matters

Agent Harness matters because production agents usually fail in the boring middle. They do not fail only because the model is weak. They fail because the system has no clear answer to questions like these:

Sketchnote showing runtime questions production agents need answered: open work, approval, inspection, output locations, long tool loops, context bloat and observability

You can solve all of those yourself. Sometimes you should. But if every agent project rebuilds those mechanics differently, the organisation ends up with ten incompatible agent runtimes hiding behind ten chat boxes.

Harness gives teams a common starting shape. That is especially useful for the class of agents that sit between conversation and work: procurement analysts, contract review assistants, sales operations copilots, customer-feedback analysts and internal productivity agents. These systems do not just answer. They inspect, plan, call tools, update files, ask for permission, recover from partial state and produce an artifact. That is not a prompt. That is a runtime.

It makes agent behavior visible (and that's my favourite thing)

The strongest signal in Harness is not one feature. It is the combination of planning, state, approvals and telemetry. A model that calls a tool is easy to demo. A model that calls the right tool, explains why, records what happened, asks before crossing a boundary, keeps a todo list, survives compaction and then exposes enough trace data for a developer to debug the run is much closer to something you can put in front of a team.

This is the same lesson I keep running into with agent orchestration and skills. The model's intelligence matters but the system's shape matters more.

With Harness, the shape says this:

Sketchnote showing the shape Agent Harness gives an agent: task state outside private reasoning, controlled tools, first-class approval, bounded long-running work, visible progress and traces, and explicit memory and file access

That is a better default than a single system prompt trying to convince the model to be careful.

When to use it

I would reach for Harness when the agent has to perform open-ended work across multiple steps and tools.

For example, a procurement agent that has to compare vendor quotes, track assumptions and write a recommendation memo. A contract review assistant that has to extract obligations, flag risky clauses and ask before producing a client-facing summary. A sales operations copilot that has to inspect CRM exports, find stalled opportunities and draft a follow-up plan for approval.

These are all cases where the risk is not only "will the answer be good?" The risk is "can the agent keep its work organised while touching real systems?"

Harness is useful when the second question matters.

When not to use it

I would not use Harness just because the word "agent" appears in a backlog item.

If the use case is a single model call with light tool use, a normal Agent Framework agent is probably enough. If the process is deterministic and the steps are well known, an Agent Framework workflow may be a better fit. If the agent definition must be centrally managed in Foundry with fixed tools and instructions, a service-managed Foundry agent may be the right anchor.

Harness is also still an emerging surface. The recent Agent Framework changes around file access, tool approval, shell support, loop behavior and Python/.NET alignment show active movement. That is a good signal for momentum but it also means you should design with version churn in mind.

Use it because the runtime shape helps your application not because it is the newest trend in the mix.

An easy sample to understand it all

This time, I would build something deliberately not flashy.

I would create a Foundry-backed procurement quote comparison harness. Give it a controlled working folder with three vendor quotes, a requirements sheet and a preferred scoring rubric. The agent can list the available files, read each quote, extract price, delivery date, payment terms, warranty, exceptions and missing information, then produce a recommendation memo.

Sketchnote showing a procurement quote comparison sample where Agent Harness reads controlled input files, uses plan mode, todo state, execute mode and approval gates, then writes an approved recommendation memo with traces

Why this sample?

Because procurement is a real business workflow, but the sample does not need private customer data or a complex app shell to make the point. You can see file access, task state, tool calls, comparison logic, controlled output and observability in one place.

The approval boundary is also natural. Reading quotes is safe. Writing the final recommendation, marking a preferred vendor or drafting the supplier follow-up should require approval. Telemetry should show which files were read, which criteria were applied, which assumptions were made and where the final memo was written.

The point would not be "look, the model can compare quotes." The point would be "look, the agent can safely move from user intent to a real business artifact while the runtime remains inspectable." That is the missing middle.

I think Agent Harness is useful. More importantly, I think it points in the right direction.

The first phase of agent development was mostly about capability i.e. can the model call tools, reason over context and produce useful output? The next phase is about operating discipline; can the agent work across steps, permissions, memory, files, tools, telemetry and recovery without each team inventing its own runtime?

Agent Harness is not the entire answer. It will not replace careful architecture, threat modeling, evaluation or domain-specific workflows. But it is a strong default for a class of agents that are too complex for a simple chat loop and too open-ended for a rigid workflow.

That is the space where most real agent work lives. Between model reasoning and real execution. Right in the middle.

Until next time.