The Three-Layer Architecture of Production Agent Harnesses

The agent harness has quietly become the most consequential piece of AI infrastructure most teams never think about. A May 2026 study from MBZUAI’s VILA Lab dissected Claude Code’s source tree (1,884 files, ~512K lines) and found that roughly 98.4% of a production agent’s codebase is harness infrastructure — permissions, context management, sandboxing, tool routing, recovery — and only about 1.6% is AI decision logic [1]. The model is the thin end of the wedge. Everything below it is engineering.

But even that 98.4% is not monolithic. In 2026, the agent stack has stratified into three distinct layers, each with its own failure modes, design tradeoffs, and tooling ecosystem. This post maps that architecture — the framework, the harness, and the platform — and examines what each layer buys you, and what it costs.

The Three-Layer Stack

Cloudflare’s June 2026 launch of the Flue framework made this stratification explicit [2]. The stack, from top to bottom:

Framework (Flue, CrewAI, AutoGen) — project structure, conventions, integrations, CLI, developer experience
Harness (Pi, Codex, Claude Code, MAF Harness) — the agentic loop, tool calling, context management, state recovery
Platform (Cloudflare Agents SDK, Foundry, Kubernetes) — compute, state, storage, sandboxing primitives

Each layer solves a different category of problem. Confusing them — or collapsing them — is where production agents break.

Framework Layer: What the Agent Knows

The framework is the developer-facing abstraction. It defines the mental model: do you script what the agent does (imperative), or describe what it knows (declarative)?

Flue takes the declarative approach. You define context — model, skills, sandbox, instructions — and the agent solves whatever task you give it autonomously [2]. There’s no orchestration loop to write. A triage agent that intercepts a bug report, reproduces it in a sandbox, and diagnoses the issue fits in ~25 lines.

CrewAI takes a role-based declarative model: agents are specialists with roles, goals, and backstories, coordinated into crews [3]. This maps naturally to content creation, research pipelines, and human-adjacent workflows. Its strength is rapid prototyping — minimal boilerplate, hundreds of built-in tool integrations.

LangGraph is the imperative counterpoint. It treats agent workflows as explicit state machines: nodes are reasoning or tool-use steps, edges define conditional transitions, and state is managed via TypedDict [3]. The control precision is unmatched, but the code volume is higher and the learning curve steeper.

The framework layer is where most teams start — and where most get stuck, because they try to solve harness-level problems with framework-level abstractions.

Harness Layer: The Agentic Loop

The harness owns the core loop: call the model, parse tool requests, execute tools, feed results back, repeat until done. This is where the 98.4% lives.

Microsoft’s Agent Framework (MAF), which reached 1.0 GA on April 2, 2026 as the convergence of AutoGen and Semantic Kernel, ships harness extensions that turn any chat client into a production-grade agent loop [4]. The harness provides:

Automatic context compaction — monitors token usage and compresses chat history mid-loop to prevent context window overflow during long tool-calling chains
Instruction merging — harness-level system instructions appear first, then custom agent instructions
Built-in providers — file memory (session-scoped persistence), file access, todo/work tracking, mode switching (plan vs. execute), skill discovery, background sub-agents for fan-out
Tool approval middleware — “don’t ask again” rules for sensitive operations

These features are not optional niceties. They are the difference between a demo that runs for three turns and a production agent that survives 300. Without automatic context compaction, every long tool-calling chain is a context window overflow waiting to happen. Without instruction merging, agents drift from their core operating constraints as prompt hacking accumulates across turns.

The MBZUAI study found that Claude Code’s harness handles roughly 60 distinct responsibilities, from shell command execution and file editing to permission checks and sub-agent spawning [1]. Each responsibility is a surface for failure. The harness is the layer that contains those failures — or doesn’t.

Platform Layer: Where the Harness Runs

A harness can’t solve distributed systems problems on its own. When an agent is interrupted mid-turn, who holds its state? When it executes untrusted code, who provides the sandbox? When a host crashes, who recovers the fiber?

These questions belong to the platform layer.

Cloudflare’s Agents SDK provides three primitives that Flue maps onto for its Cloudflare target [2]:

runFiber() — records checkpoint progress to Durable Object SQLite storage before work starts
stash() — snapshots intermediate state as the turn advances
onFiberRecovered() — delivers the last checkpoint to a fresh instance after interruption

Each event in the execution history is written to an append-only log. If a process dies, another picks up the log and continues from the exact step it left off. This is durable execution for agents — the same pattern that Temporal and Azure Durable Functions provide for workflows, but adapted for the specific failure modes of LLM-driven loops.

Microsoft’s Foundry Hosted Agents takes a different approach: per-session VM isolation with persistent filesystem state, scale-to-zero, and automatic resume [4]. Every session gets its own VM-isolated sandbox. Scale-to-zero means you pay nothing while the agent is idle; incoming requests trigger a cold start that restores filesystem, disk state, and session identity.

The platform layer is where the industry is investing most heavily in 2026. Google’s A2A protocol (now a Linux Foundation project with 150+ supporters) and Anthropic’s MCP (Model Context Protocol) are both platform-level standards [3][5]. A2A handles agent-to-agent communication; MCP standardizes how agents exchange context with tools and external systems. Together they define the interoperability contracts that the platform layer must implement.

The Architectural Gradient

The three-layer stack is not rigid — there’s a gradient from prototype to production:

Stage	Framework	Harness	Platform
Prototype	CrewAI or Flue (minimal config)	Default harness	Local process
Staging	LangGraph (explicit state)	MAF Harness with tools	Container with recovery
Production	Custom graph with audit	Full harness (compaction, approval, checkpointing)	Durable execution (Fibers, Foundry, Temporal)

The Zylos Research survey of enterprise copilot spending ($7.2B in 2026) found that 86% goes to agent-based systems, and 40%+ of agentic projects may be cancelled by 2027 due to cost and complexity [3]. The teams that succeed are those that don’t skip layers — they invest in the harness and platform early, rather than retrofitting them after the prototype breaks.

Key Design Decisions for Each Layer

Framework decisions

Declarative vs. imperative — CrewAI/Flue (declarative) for rapid iteration; LangGraph (imperative/state-machine) for compliance-critical workflows
Human-in-loop spectrum — Deloitte identifies three models: in the loop (approve each step), on the loop (supervise), out of the loop (continuous monitoring) [6]. Most enterprises are converging on “on the loop”
Interoperability — Adopt A2A and MCP before you need them. Retrofitting is expensive

Harness decisions

Context window strategy — without automatic compaction, tool-calling chains are bounded by the model’s context window. This is the single most common production failure
Tool approval granularity — per-call, per-session, or rule-based. Microsoft’s “don’t ask again” pattern is a good starting point [4]
Provider architecture — file memory, skill discovery, and background agent execution should be pluggable, not hard-coded

Platform decisions

State durability — in-memory is fine for demos. Production needs checkpointed, durable state with recovery
Sandboxing — MicroVM per tool call (Hyperlight [4]), container per session (Foundry), or Durable Object per agent (Cloudflare). The isolation boundary determines your blast radius
Observability — OpenTelemetry Semantic Conventions for agents are emerging [4]. Wire them in from day one

The 1.6% Problem

The MBZUAI finding that ~1.6% of a production agent is AI logic is not a criticism — it’s a design constraint [1]. If 98.4% of your agent is infrastructure, then improvements to the model (faster inference, better reasoning) can at most affect 1.6% of the codebase. The other 98.4% needs its own improvement cycle: faster tool execution, better context compaction algorithms, more reliable checkpointing, tighter sandboxing.

This reframes the agent performance problem. When an agent is slow, the bottleneck is rarely the model — it’s the harness overhead of accumulating context across tool calls, or the platform overhead of spawning sandboxes. Optimizing prompt templates while ignoring harness latency is optimizing the 1.6%.

References

[1] VILA Lab, MBZUAI / UCL. “Dive into Claude Code: The Design Space of Today’s and Future AI Agent Systems.” arXiv:2604.14228, April 2026. https://arxiv.org/abs/2604.14228

[2] Thomas Gauvin, Cloudflare. “Bringing more agent harnesses and frameworks to Cloudflare, starting with Flue.” June 2026. https://blog.cloudflare.com/agents-platform-flue-sdk/

[3] Zylos Research. “AI Agent Orchestration Frameworks: LangGraph, CrewAI, AutoGen Comparison (2026).” January 2026. https://zylos.ai/research/2026-01-12-ai-agent-orchestration-frameworks/

[4] Shawn Henry, Microsoft Agent Framework Team. “Microsoft Agent Framework at BUILD 2026: Agent Harness, Hosted Agents, CodeAct, and more.” June 2026. https://devblogs.microsoft.com/agent-framework/microsoft-agent-framework-at-build-2026-announce/

[5] MLflow. “Building Production-Ready AI Agents in 2026.” May 2026. https://mlflow.org/articles/building-production-ready-ai-agents-in-2026/

[6] Deloitte. “Unlocking exponential value with AI agent orchestration.” 2026. Referenced in Zylos Research [3].

NiteAgent — AI agent development, frameworks, and production patterns
ToolBrain — tool reviews, LLM comparisons, and AI workflow guides

Cross-links automatically generated from CodeIntel Log.