Tool Use — How the Model Calls Your Code (And What It Never Sees)

One of the most important things to internalize about LLM tool use is what the model actually does (and doesn't do) when it "calls" a function.

Anthropic never executes your code. The model reads the description and schema you provide, decides this is the right tool for the job, generates a valid set of arguments, and stops. You run the function. You send the result back. The model then continues generating based on that result.

Understanding this distinction changes how you think about building reliable tool-based systems.

The Conversation Flow

Tool use turns a single API call into a multi-turn exchange:

You send a message + your tool definitions
The model responds with a tool_use block containing a name and arguments
You run the actual function with those arguments
You send the result back as a tool_result message
The model generates its final response using that result

This is fundamentally different from a normal completion. The model is mid-sentence when it decides to call a tool — it needs you to act on that, then resumes where it left off.

What This Looks Like in Code

In a streaming context, adding tool use changes the stream in two ways. First, instead of only seeing text_delta events, you'll now see input_json_delta events — the model streaming the JSON arguments for a tool call in chunks. Second, content_block_start becomes load-bearing, because it tells you what kind of block is opening:

// content_block_start for a tool call looks like:
{
  type: 'content_block_start',
  content_block: {
    type: 'tool_use',
    id: 'toolu_01A09q90qw90lq917835lq9',
    name: 'get_weather',
    input: {}
  }
}

When you see a tool_use block starting, you capture the name and id, then accumulate the incoming input_json_delta chunks into a string. You don't parse that string until content_block_stop fires — that's your signal that the block is complete and safe to process.

Why wait for content_block_stop? Because a single message can contain multiple content blocks arriving in sequence. Parsing at message end would concatenate them incorrectly. Parsing at block stop means each tool call is handled as its own logical unit.

The Conversation State Problem

With basic streaming, the messages array was static — one request, one response. With tool use, it grows:

// After the model calls a tool:
messages = [
  { role: "user", content: "What's the weather in Chicago?" },
  { role: "assistant", content: [{ type: "tool_use", id: "...", name: "get_weather", input: { city: "Chicago" } }] },
  { role: "user", content: [{ type: "tool_result", tool_use_id: "...", content: "72°F, partly cloudy" }] }
]

If you don't append the tool calls and results to the conversation history before the next request, the model has no memory of what it called or what came back. The full conversation state must be sent with every turn.

This is also why the routing layer becomes a loop, not a single pass. Each iteration sends the current conversation state, streams the response, and either breaks (model is done) or continues (model called a tool and needs the result before it can finish).

Writing Tool Descriptions That Actually Work

Since the model selects tools based on their descriptions — not their implementations — description quality is where reliability lives. A useful mental model:

Code concept	Tool concept
Function signature	Tool name + schema
JSDoc / comments	Tool description + property descriptions
Function body	Your implementation (model never sees this)
Return value	The tool result you send back

The key difference from a function: the output is not deterministic. Given the same inputs, there's no guarantee the model selects the same tool or generates the same arguments every time. This is why description quality matters more than implementation quality for reliability.

When you have multiple tools and the model needs to pick between them, descriptions need to clearly differentiate. A good test: if a new engineer read only the description, would they know when to use this tool vs. the others?

// Too vague — model will guess when to use this
{
  name: "search",
  description: "Searches for information"
}

// Specific enough to be reliable
{
  name: "search_products",
  description: "Searches the product catalog by name, category, or SKU. Use this when the user is looking for a specific product or browsing a category. Do not use for order status, shipping, or account questions."
}

Why This Matters Beyond Demos

The tool use pattern shows up in almost every production LLM system worth building: agents that can query databases, orchestrators that delegate to specialist models, copilots that can take actions in your application. The underlying mechanism is always the same — the model generates intent, your code executes, the result flows back.

The engineering challenge isn't the API call. It's building the conversation state management, the result routing, and the error handling that makes this loop reliable at scale. Those are the parts that look easy in tutorials and break in production.

Tool Use — How the Model Calls Your Code (And What It Never Sees)

The Conversation Flow

What This Looks Like in Code

The Conversation State Problem

Writing Tool Descriptions That Actually Work

Why This Matters Beyond Demos

Comments

Anthropic API

Cost & Latency Tracking — What the Token Counts Were Telling Me All Along

More from this blog

Cost & Latency Tracking — What the Token Counts Were Telling Me All Along

Error Handling in LLM Systems — Three Categories, One Decision Tree

Topic Suggestion — Designing a Function That Knows What to Recommend Without Magic Numbers

Streaming Structured Output — Incremental JSON Rendering Without a Parser

Evals — Why a Bad Eval Is Worse Than No Eval

Command Palette

The Conversation Flow

What This Looks Like in Code

The Conversation State Problem

Writing Tool Descriptions That Actually Work

Why This Matters Beyond Demos

Comments

Anthropic API

Cost & Latency Tracking — What the Token Counts Were Telling Me All Along

More from this blog