The Architecture of Tool-Use in Agent Systems

Tool-use is the mechanism that transforms an LLM from a chatbot into an agent. But the execution loop — schema definition, tool selection, invocation, error recovery, result integration — is surprisingly complex in production. This post maps the architecture of tool-use in modern agent systems, drawing on the latest production patterns from Anthropic, the MCP ecosystem, and empirical research.

The Four-Phase Tool-Use Cycle

Every tool interaction follows the same four-phase cycle [4]:

Define — Declare available tools with structured schemas (name, description, input parameters) [4]
Select — The LLM chooses a tool and fills parameter values [4]
Invoke — The harness executes the tool and captures the result [4]
Integrate — The result is injected back into the conversation context [4]

This sounds simple, but each phase has failure modes that compound [1][2]. A poorly-written description causes wrong tool selection. A missing parameter format causes invocation errors. A massive result pollutes context [4]. Production systems need safeguards at every phase [4].

Phase 1: Define — The Schema Is a Prompt

Tool definitions are dual-nature artifacts: they serve as both a specification (what the tool does) and a prompt (how the model should reason about using it) [2]. JSON Schema defines structure but cannot express usage patterns — parameter correlations, conventions, or when a tool should be used vs. avoided [2][6].

A large-scale study of 856 tools across 103 MCP servers found that 97.1% of tool descriptions contain at least one “smell” — recurring suboptimal patterns that degrade clarity [2]. The most common: 56% of tools fail to state their purpose clearly [2].

The Six Components of a Good Tool Description

The study identified six essential components [2]:

Component	Function
Purpose	What the tool does
Guidelines	When/how to use it (activation criteria + operational instructions)
Limitations	Constraints and caveats
Parameter Explanation	Meaning and intent of each parameter
Length & Completeness	Adequate detail (≥3-4 sentences for complex tools)
Examples	Illustrative usage

Augmenting descriptions to include all six components improved task success rates by a median of 5.85 percentage points and partial goal completion by 15.12% — but at a cost: 67.46% more execution steps and regressions in 16.67% of cases [2]. The fix creates a new problem: token bloat [1].

Phase 2: Select — The Tool Search Problem

The traditional approach loads all tool definitions upfront. This breaks at scale [1]. Anthropic reports that a typical setup with 5 MCP servers (58 tools) consumes ~55K tokens just in tool definitions [1]. Adding Jira pushes it to ~72K tokens before conversation starts [1]. Their internal deployment hit 134K tokens in tool definitions alone [1].

Deferred Loading

Anthropic’s solution is Tool Search — a meta-tool that defers loading most tools. Only a search tool (~500 tokens [1]) plus a few critical tools are loaded upfront. When the LLM needs a specific tool, it searches the registry and loads only relevant matches (3-5 tools ≈ 3K tokens) [1].

Key numbers [1]:

Total context consumption: ~77K → ~8.7K tokens (85% reduction) [1]
Context window preserved: 95% [1]
Accuracy on MCP evals (Opus 4.5): 79.5% → 88.1% (+8.6pp) [1]

Dynamic Tool Retrieval

For systems with 50+ tools, embedding-based retrieval selects the top-k relevant tools at each turn [4]. This reduces noise in the selection set — fewer tools means fewer wrong choices. The tradeoff: retrieval adds latency and requires a separate embedding/indexing pipeline [4].

Phase 3: Invoke — Execution Patterns

Tool execution has distinct production patterns:

Synchronous (Default)

The harness invokes the tool, blocks for the result, and returns it to the conversation. Simple, reliable, but serial — 20 tool calls require 20 full inference passes [1].

Programmatic (Code-Mediated)

Anthropic’s Programmatic Tool Calling (PTC) lets the LLM write Python code that orchestrates tool calls in a sandboxed environment [1]. Instead of 20 model round-trips, the model writes one script that calls tools programmatically. Only the final result enters context [1].

The impact on complex research tasks [1]:

37% token reduction (43,588 → 27,297 tokens) [1]
Knowledge retrieval accuracy: 25.6% → 28.5% [1]
Eliminates 19+ inference passes for 20 tool calls [1]

Parallel (Fan-Out)

For independent tool calls, the orchestrator-worker pattern [4] dispatches tools in parallel using Send-style fan-out. Wall-clock time drops from linear to the slowest single call, but cost multiplies with concurrency [4].

Phase 4: Integrate — Result Management

The result integration phase is where production systems most commonly degrade. Large tool results (10MB log files, thousands of database rows) pollute the context window and push out earlier reasoning [1]. Two strategies:

Summarization with PTC: Scripts process results in the code execution environment and return only aggregated output (e.g., “Total: $12,450 — 3 items over budget” instead of 2000 expense line items) [1].

Structured result schemas: Define result types per tool so the harness can validate and truncate. A file-read tool returns a chunk, not the whole file. A search tool returns page 1, not all matches [6].

MCP: The Standardization Layer

The Model Context Protocol [3] is the de facto standard for tool interoperability [5]. As of March 2026, MCP has 97 million monthly SDK downloads, over 81,000 GitHub stars, and support from every major AI vendor — Anthropic, OpenAI, Google, Microsoft, AWS [3][5].

Current Architecture

MCP uses a client-server architecture that separates AI agents from tool implementations. The client discovers tools via tools/list and invokes them via tools/call [3]. Tool descriptions (name, description, input schema in JSON Schema) are the primary contract [2].

2026 Roadmap Priorities

The MCP roadmap [3] identifies four priority areas for 2026 — all relevant to production tool-use:

Transport evolution and scalability — Stateful sessions conflict with load balancers; the protocol needs stateless, horizontally-scalable transport [3].
Agent communication — The Tasks primitive (SEP-1686) shipped experimentally; production use identified gaps in retry semantics and result expiry [3].
Governance maturation — A contributor ladder and delegated review process to unblock protocol evolution [3].
Enterprise readiness — Audit trails, SSO auth, gateway behavior, configuration portability [3].

Putting It Together: A Production Tool-Use Stack

A production-grade tool-use harness combines all these patterns:

┌─────────────────────────────────────┐
│          Tool Registry              │
│  (descriptions + metadata + index)  │
├─────────────────────────────────────┤
│         Tool Selection              │
│  Static (few tools) / Dynamic (many) │
├─────────────────────────────────────┤
│         Tool Execution              │
│  Sync / Programmatic / Parallel     │
├─────────────────────────────────────┤
│         Result Management           │
│  Summarize / Validate / Truncate   │
├─────────────────────────────────────┤
│         Protocol Layer (MCP)        │
│  Discovery + Invocation + Auth      │
└─────────────────────────────────────┘

Each layer has tuned configurations per model: Claude handles deferred loading natively; GPT systems need custom retrieval [1][4]. Different toolsets per model, different prompts, different execution modes [1][4].

The Takeaway

Tool-use is no longer a feature — it’s a first-class architectural concern in agent systems. The model selects the tool, but the harness defines which tools exist, how they’re described, how they execute, and how their results are processed [1]. As MCP standardizes the protocol layer, the differentiation shifts to execution patterns — deferred loading, programmatic orchestration, parallel fan-out, and intelligent result management [1][4].

The systems that get this right will have agents that use tools faster, more accurately, and with less context pollution. The systems that don’t will burn tokens on wrong tool selections and bloated contexts [1][2].

References

[1] Anthropic, “Introducing advanced tool use on the Claude Developer Platform,” Nov 24, 2025. https://www.anthropic.com/engineering/advanced-tool-use

[2] M. M. Hasan et al., “Model Context Protocol (MCP) Tool Descriptions Are Smelly! Towards Improving AI Agent Efficiency with Augmented MCP Tool Descriptions,” arXiv:2602.14878, Feb 2026. https://arxiv.org/html/2602.14878v1

[3] “The 2026 MCP Roadmap,” Model Context Protocol Blog. https://blog.modelcontextprotocol.io/posts/2026-mcp-roadmap/

[4] SitePoint, “Agentic Design Patterns: The 2026 Guide to Building Autonomous Systems,” Mar 2, 2026. https://www.sitepoint.com/the-definitive-guide-to-agentic-design-patterns-in-2026/

[5] “Model Context Protocol,” Wikipedia. https://en.wikipedia.org/wiki/Model_Context_Protocol

[6] Anthropic, “Tool use with Claude,” Claude API Docs. https://platform.claude.com/docs/en/agents-and-tools/tool-use/overview