Skip to main content

Command Palette

Search for a command to run...

Tool Use — How the Model Calls Your Code (And What It Never Sees)

Updated
5 min read
D
Senior engineer with deep roots in React, TypeScript, and production-scale UI — including 4 years at Apple. Now focused on applied AI engineering: working hands-on with LLM APIs, RAG pipelines, agents, and the full stack of tooling that modern AI products are built on. Writing here to document the build, share what's non-obvious, and connect with teams working on hard AI problems.

One of the most important things to internalize about LLM tool use is what the model actually does (and doesn't do) when it "calls" a function.

Anthropic never executes your code. The model reads the description and schema you provide, decides this is the right tool for the job, generates a valid set of arguments, and stops. You run the function. You send the result back. The model then continues generating based on that result.

Understanding this distinction changes how you think about building reliable tool-based systems.

The Conversation Flow

Tool use turns a single API call into a multi-turn exchange:

  1. You send a message + your tool definitions

  2. The model responds with a tool_use block containing a name and arguments

  3. You run the actual function with those arguments

  4. You send the result back as a tool_result message

  5. The model generates its final response using that result

This is fundamentally different from a normal completion. The model is mid-sentence when it decides to call a tool — it needs you to act on that, then resumes where it left off.

What This Looks Like in Code

In a streaming context, adding tool use changes the stream in two ways. First, instead of only seeing text_delta events, you'll now see input_json_delta events — the model streaming the JSON arguments for a tool call in chunks. Second, content_block_start becomes load-bearing, because it tells you what kind of block is opening:

// content_block_start for a tool call looks like:
{
  type: 'content_block_start',
  content_block: {
    type: 'tool_use',
    id: 'toolu_01A09q90qw90lq917835lq9',
    name: 'get_weather',
    input: {}
  }
}

When you see a tool_use block starting, you capture the name and id, then accumulate the incoming input_json_delta chunks into a string. You don't parse that string until content_block_stop fires — that's your signal that the block is complete and safe to process.

Why wait for content_block_stop? Because a single message can contain multiple content blocks arriving in sequence. Parsing at message end would concatenate them incorrectly. Parsing at block stop means each tool call is handled as its own logical unit.

The Conversation State Problem

With basic streaming, the messages array was static — one request, one response. With tool use, it grows:

// After the model calls a tool:
messages = [
  { role: "user", content: "What's the weather in Chicago?" },
  { role: "assistant", content: [{ type: "tool_use", id: "...", name: "get_weather", input: { city: "Chicago" } }] },
  { role: "user", content: [{ type: "tool_result", tool_use_id: "...", content: "72°F, partly cloudy" }] }
]

If you don't append the tool calls and results to the conversation history before the next request, the model has no memory of what it called or what came back. The full conversation state must be sent with every turn.

This is also why the routing layer becomes a loop, not a single pass. Each iteration sends the current conversation state, streams the response, and either breaks (model is done) or continues (model called a tool and needs the result before it can finish).

Writing Tool Descriptions That Actually Work

Since the model selects tools based on their descriptions — not their implementations — description quality is where reliability lives. A useful mental model:

Code concept Tool concept
Function signature Tool name + schema
JSDoc / comments Tool description + property descriptions
Function body Your implementation (model never sees this)
Return value The tool result you send back

The key difference from a function: the output is not deterministic. Given the same inputs, there's no guarantee the model selects the same tool or generates the same arguments every time. This is why description quality matters more than implementation quality for reliability.

When you have multiple tools and the model needs to pick between them, descriptions need to clearly differentiate. A good test: if a new engineer read only the description, would they know when to use this tool vs. the others?

// Too vague — model will guess when to use this
{
  name: "search",
  description: "Searches for information"
}

// Specific enough to be reliable
{
  name: "search_products",
  description: "Searches the product catalog by name, category, or SKU. Use this when the user is looking for a specific product or browsing a category. Do not use for order status, shipping, or account questions."
}

Why This Matters Beyond Demos

The tool use pattern shows up in almost every production LLM system worth building: agents that can query databases, orchestrators that delegate to specialist models, copilots that can take actions in your application. The underlying mechanism is always the same — the model generates intent, your code executes, the result flows back.

The engineering challenge isn't the API call. It's building the conversation state management, the result routing, and the error handling that makes this loop reliable at scale. Those are the parts that look easy in tutorials and break in production.

More from this blog

D

David Hahn | Applied AI Engineering

7 posts

Real implementations, real bugs, and the mental models behind building on LLMs. I'm a fullstack engineer with 10+ years of experience, including 4 years at Apple, now going deep on applied AI: streaming, RAG, tool use, agents, and the full stack of tooling modern AI products are built on.