Tool Use + RAG — When Retrieval Becomes a Decision

In the basic RAG module, retrieval was unconditional. Every query triggered the same pipeline: embed the query, search the vector store, stuff the top chunks into the prompt, generate. The model had no say in whether retrieval was even appropriate.

That works fine for a single-purpose search tool. It breaks down the moment you have multiple data sources, multiple tools, and questions that don't all need the same retrieval path.

The fix is conceptually simple: make RAG a tool. Instead of retrieval being a mandatory pre-processing step, it becomes one capability among many that the model can choose to invoke — or skip — based on what the question actually needs.

What Changes

In 03-rag-basic, the flow was linear and predetermined:

query → embed → retrieve → prompt → generate

In 06-tool-use-rag, the flow is dynamic:

query → model decides → [search_handbook? get_employee_info? get_pto_balance? none?] → generate

Given the question "How much PTO does alice.bob have left?", the model might call get_employee_info and get_pto_balance in parallel. Given "What's our parental leave policy?", it calls only search_handbook. Given "What does RAG stand for?", it answers directly without calling anything.

That routing decision — which is made by the model, not hardcoded — is what makes the agent pattern fundamentally more capable than a fixed retrieval pipeline.

Parallel vs. Sequential Tool Calls

This is one of the more important distinctions to understand for production agent work.

Parallel: the model calls multiple tools in a single turn because it already knows what it needs and the results don't depend on each other. Both calls go out simultaneously, both results come back, the model synthesizes them into a single response. This is the common case for questions that span data sources.

Sequential: the model calls one tool, receives the result, then decides it needs another tool based on what came back. This requires multiple loop iterations. A question like "find the most severe open bug and draft a response" can't call get_issue_details until it knows which issue ID to look up from the search results.

This is exactly why the routing layer needs to be a while loop, not an if/else. A single branch only handles one round of tool calls. The loop keeps running until stop_reason === "end_turn" — meaning the model is done and ready to generate a final response.

while (true) {
  const response = await streamAndCollect(messages, tools);
  messages.push({ role: "assistant", content: response.content });

  if (response.stop_reason === "end_turn") break;

  // execute tool calls, append results, loop
  const toolResults = await executeTools(response.content);
  messages.push({ role: "user", content: toolResults });
}

Why System Prompt Ordering Is Not Optional

Tool descriptions tell the model what each tool does. The system prompt tells it when and in what order to use them. Both are required for reliable behavior — and confusing the two is one of the most common failure modes in production agent systems.

Without explicit instructions, the model tends to answer policy questions from its training data instead of actually searching the handbook. The answers sound confident and can be completely wrong.

system: `Guidelines:
- Always search the handbook before answering policy questions — don't rely on your own knowledge
- If a question involves a specific employee, look them up first
- If a question needs both employee info and a policy, use both tools
- If you can't find relevant information, say so — don't make things up`

The tool description handles the "what." The system prompt handles the "when" and "in what order." Omitting the system prompt guidance means you're hoping the model infers the right sequencing from tool descriptions alone — which it sometimes will and often won't.

The Cross-Referencing Behavior

The most interesting result from testing this module: when asked about PTO for an employee with three years of tenure, the model:

Called get_employee_info to confirm tenure (3 years)
Called search_handbook to retrieve the PTO policy
Matched the tenure against the correct policy bracket (2–5 years = 20 days)
Added the remaining balance by combining pto_balance and pto_used from the employee data — without being asked

No single tool returned that answer. The model synthesized two results into a response richer than what either source contained alone. That synthesis across data sources — rather than lookup from one — is the core value of the agent pattern over a fixed retrieval pipeline.

What This Looks Like in Production

The setup in this module is a simplified version of what real enterprise FDE deployments actually look like:

Module tool	Production equivalent
`search_handbook`	Vector search over internal docs, Confluence, Notion
`get_employee_info`	HR system API (Workday, BambooHR)
`get_pto_balance`	Payroll system API

The model doesn't care that these are mocked. The architecture is identical. Swapping a mock for a real API is one line of change in the tool execution layer.

The hard problem in this kind of work isn't the model call. It's the integrations — real auth, real rate limits, real data permissions, real failure modes. That's where most of the engineering work actually lives in production agent systems, and it's the part that's invisible in most demos.

Tool Use + RAG — When Retrieval Becomes a Decision

Comments

Anthropic API

Structured Output — When to Use Prompting vs. Forced Tool Use

More from this blog

Cost & Latency Tracking — What the Token Counts Were Telling Me All Along

Error Handling in LLM Systems — Three Categories, One Decision Tree

Topic Suggestion — Designing a Function That Knows What to Recommend Without Magic Numbers

Streaming Structured Output — Incremental JSON Rendering Without a Parser

Evals — Why a Bad Eval Is Worse Than No Eval

What Changes

Parallel vs. Sequential Tool Calls

Why System Prompt Ordering Is Not Optional

The Cross-Referencing Behavior

What This Looks Like in Production

Command Palette

Comments

Anthropic API

Structured Output — When to Use Prompting vs. Forced Tool Use

More from this blog

What Changes

Parallel vs. Sequential Tool Calls

Why System Prompt Ordering Is Not Optional

The Cross-Referencing Behavior

What This Looks Like in Production