Tools

Intro

Tools are the interface between an LLM's reasoning and the external world — they let the model read data, perform computations, and trigger side effects that text generation alone cannot accomplish. In agentic systems, tools determine what the agent can actually do: without well-designed tools, even a strong model with perfect reasoning produces useless output. Anthropic's SWE-bench agent team found that tool quality had more impact on task success than prompt quality — switching from relative to absolute file paths in one tool eliminated an entire class of failures.

The mechanism is function calling: the model receives JSON schemas describing available tools (name, description, parameters), and when it decides a tool is needed, it emits a structured call instead of text. The runtime executes the function, returns the result, and the model continues reasoning. This cycle repeats inside the agent loop until the model produces a final answer.

sequenceDiagram
    participant U as User
    participant R as Runtime
    participant M as Model
    participant T as Tool
    U->>R: User message
    R->>M: Messages + tool schemas
    M->>R: tool_call name and args
    R->>R: Validate args against schema
    R->>T: Execute function
    T->>R: Result or error
    R->>M: Tool result message
    M->>R: Final text response
    R->>U: Answer

The model never executes tools directly — it only predicts which tool to call and what arguments to pass. The runtime handles execution, validation, and error propagation. This separation is a security boundary: the model cannot bypass schema validation or invoke tools not in its provided schema.

For how tools are standardized across clients via a shared protocol, see Model Context Protocol. For how the model selects and calls tools within a reasoning loop, see Agent Loop.

Tool Design Principles

Tool design is API design for an LLM consumer. The same principles that make an API easy for a junior developer apply — but with tighter constraints, because the model cannot read source code, ask clarifying questions, or debug at runtime.

Naming. Use specific, self-documenting function names that signal exactly what the tool does. search_company_directory beats search. get_weather_forecast beats get_data. The model picks tools by matching names and descriptions to its current subgoal — ambiguous names cause wrong tool selection.

Descriptions. Write tool descriptions for the model, not for humans. Include: what the tool does, when to use it, what it returns, and when not to use it. Anthropic recommends including boundary conditions: "Use this to search for employees by name or department. Do not use this for contractor lookup — use search_contractor_database instead."

Parameters. Keep schemas flat and simple. Nested objects degrade argument accuracy. Use enums to constrain values where possible — {"type": "string", "enum": ["celsius", "fahrenheit"]} prevents the model from inventing units. Mark required versus optional fields explicitly. Every parameter needs a description that explains both format and purpose.

Return values. Return only the fields the model needs for its next reasoning step. Returning a full database row when the model only needs one field wastes context tokens and dilutes attention. Structure returns consistently across tools — if all tools return {"result": ..., "error": ...}, the model learns the pattern quickly.

Error messages as teaching signals. When a tool call fails, return a structured error that tells the model what went wrong and how to fix it: {"error": "invalid_date_format", "message": "Expected YYYY-MM-DD, got '12/25/2024'", "hint": "Reformat as 2024-12-25"}. The model can self-correct on the next loop iteration if the error is specific. Silent failures or generic "internal error" messages leave the model stuck.

In Semantic Kernel (.NET), a well-designed tool looks like this:

public class WeatherPlugin
{
    [KernelFunction, Description(
        "Get current weather for a city. Returns temperature, conditions, " +
        "and humidity. Use when the user asks about weather or outdoor plans. " +
        "Do not use for historical weather data.")]
    public async Task<WeatherResult> GetCurrentWeather(
        [Description("City name, e.g. 'Seattle' or 'London'")] string city,
        [Description("Temperature unit")] TemperatureUnit unit = TemperatureUnit.Celsius)
    {
        // Validate input, call weather API, return compact result
    }
}

public enum TemperatureUnit { Celsius, Fahrenheit }

The Description attributes become the tool schema the model reads to decide whether and how to call this function. Investing time in descriptions pays more dividends than prompt engineering.

Versatility

A versatile tool handles varied inputs gracefully rather than failing on anything outside the happy path. In agentic systems, the model generates inputs — you cannot predict the exact format or phrasing it will use. Design tools to accept reasonable variations and normalize internally.

Concrete patterns:

The principle: the more rigid a tool's interface, the more likely the model misuses it. Each failure costs a loop iteration, tokens, latency, and sometimes a cascading series of wrong decisions. Validation should correct, not just reject.

Fault Tolerance

Tools in agentic systems run inside a loop where failures compound — one failed tool call can derail an entire multi-step plan. Design tools to degrade gracefully, never silently.

Structured error returns. Every tool should have a consistent error contract. The model cannot handle exceptions — it only sees the serialized return value. Return typed errors with actionable context: what failed, why, and what the model should do differently.

Retry-safe design. If a tool call might be retried (network timeout, transient failure), the tool must be idempotent — calling it twice with the same arguments should not create duplicate records, send duplicate emails, or charge twice. Use idempotency keys for state-mutating operations.

Timeout handling. Long-running tools (API calls, database queries) need timeouts with partial result support. Rather than hanging indefinitely, return what you have: {"status": "partial", "results": [...], "message": "Query timed out after 5s, returning first 50 results"}. The model can decide whether to proceed with partial data or retry.

Input validation before execution. Validate tool arguments before performing any side effect. A tool that sends an email should validate the address format before making the API call, not after. Return validation errors as structured feedback so the model can fix and retry.

Caching

Caching reduces latency, cost, and token waste in agent loops. There are two layers to consider:

Tool result caching. When the same tool is called with the same arguments within a session, cache the result instead of re-executing. This is especially valuable for read-only tools (database lookups, search queries, API fetches) that the model calls repeatedly. A common production pattern: hash the function name + arguments as a cache key with a short TTL (30s–5min depending on data freshness requirements). In the agent loop, this prevents the infinite-loop pitfall where the model calls the same search repeatedly without making progress.

Prompt caching for tool schemas. When tool schemas are large (many tools, detailed descriptions), they consume significant input tokens on every request. Anthropic's prompt caching feature caches the system prompt and tool definitions across requests, reducing input token cost by up to 90% and latency by up to 85% for subsequent calls. OpenAI's API also supports automatic prompt caching for tool definitions that remain stable across requests.

Staleness trade-off. Cache read-only tools aggressively. Caching state-mutating tools is dangerous — if the model calls create_ticket and gets a cached "success" response, no ticket was actually created. Only cache tools whose results are deterministic for the given arguments within the cache TTL.

Pitfalls

Over-Parameterized Tools

A tool with 15 parameters gives the model 15 opportunities to hallucinate an argument. Each optional parameter increases the surface area for errors. Prefer multiple focused tools over one Swiss-army-knife tool. A search_by_name(name) and search_by_department(dept) pair is more reliable than search(name?, dept?, role?, location?, start_date?, ...).

Poor Descriptions That Mislead the Model

Vague descriptions like "Processes data" or "Handles requests" give the model no basis for deciding when to use the tool. The model selects tools by matching descriptions to its current subgoal — if the description does not clearly state what the tool does, when to use it, and what it returns, the model will either skip it when needed or misuse it.

Tools with Hidden Side Effects

A tool named get_user_profile that also logs an analytics event and updates a "last accessed" timestamp has hidden side effects the model cannot reason about. If the model calls it exploratively during planning, the side effects fire unintentionally. Keep read tools read-only. Separate queries from commands — this is CQRS applied to tool design.

Context Degradation from Large Toolsets

Adding more tools does not just cost tokens — it actively degrades accuracy. MCPGauge (Song et al., 2025) tested 6 commercial LLMs with 30 MCP tool suites and measured an average 9.5% accuracy drop when tools were present, with code generation worst-hit at −17%. Token overhead ranged from 3.25× to 236.5× input tokens. A single GitHub MCP server (26 tools) consumes over 4,600 tokens in schema definitions alone; the full MCP ecosystem (2,797 tools) would consume 248K tokens.

Three mechanisms compound:

Mitigations:

Technique How it works Best for
On-demand tool search Tools register with deferred loading; a search tool retrieves 3–5 relevant definitions per query (Anthropic's native tool_search_tool). 85%+ context reduction. 50–10,000 tools
RAG over tool descriptions Embed tool descriptions in a vector index; retrieve top-k by semantic similarity per query. You control the retrieval pipeline. 500+ tools, custom retrieval
Middleware filtering Rule-based layer injects only relevant tools based on conversation state, user role, or stage. Zero retrieval overhead. 10–50 tools, deterministic routing
Tool consolidation Group related operations under one tool with an action enum (e.g., github_pr with create|review|merge). Directly reduces schema count. Related operations, any scale
Two-stage routing Stage 1 classifies query into a tool category; Stage 2 shows only that category's tools. Can use a small classifier or the tool search itself. Multi-domain, any scale
Code generation Replace N tool schemas with a single execute_code tool + API docs. The model writes code that calls your APIs. Open-ended data/code tasks
Structured output routing Model returns a structured action JSON; your code dispatches. No tool schemas needed. Fixed action types

Tradeoffs

Design choice Option A Option B Decision criteria
Tool granularity Few broad tools with many parameters Many narrow tools with focused purpose Narrow tools are more reliable per-call (fewer hallucinated args) but increase selection confusion as count grows. Split when use cases need genuinely different descriptions; keep together when they share context.
Input handling Strict validation — reject malformed input Flexible normalization — accept variations, convert internally Normalization reduces loop failures and retries at the cost of implementation complexity. Prefer normalization for agent-facing tools; strict validation for human-facing APIs.
Caching strategy Aggressive — cache all tool results with TTL Conservative — execute every call fresh Aggressive caching cuts latency and cost but risks stale data. Cache read-only tools with short TTLs; never cache state-mutating tools.
Return verbosity Full result payload Minimal fields needed for next step Minimal returns save context tokens and reduce attention dilution. Full returns are only justified when the model needs to branch on fields that are hard to predict upfront.

Questions

References


Whats next