Tool Use + RAG — When Retrieval Becomes a Decision
In the basic RAG module, retrieval was unconditional. Every query triggered the same pipeline: embed the query, search the vector store, stuff the top chunks into the prompt, generate. The model had no say in whether retrieval was even appropriate.
That works fine for a single-purpose search tool. It breaks down the moment you have multiple data sources, multiple tools, and questions that don't all need the same retrieval path.
The fix is conceptually simple: make RAG a tool. Instead of retrieval being a mandatory pre-processing step, it becomes one capability among many that the model can choose to invoke — or skip — based on what the question actually needs.
What Changes
In 03-rag-basic, the flow was linear and predetermined:
query → embed → retrieve → prompt → generate
In 06-tool-use-rag, the flow is dynamic:
query → model decides → [search_handbook? get_employee_info? get_pto_balance? none?] → generate
Given the question "How much PTO does alice.bob have left?", the model might call get_employee_info and get_pto_balance in parallel. Given "What's our parental leave policy?", it calls only search_handbook. Given "What does RAG stand for?", it answers directly without calling anything.
That routing decision — which is made by the model, not hardcoded — is what makes the agent pattern fundamentally more capable than a fixed retrieval pipeline.
Parallel vs. Sequential Tool Calls
This is one of the more important distinctions to understand for production agent work.
Parallel: the model calls multiple tools in a single turn because it already knows what it needs and the results don't depend on each other. Both calls go out simultaneously, both results come back, the model synthesizes them into a single response. This is the common case for questions that span data sources.
Sequential: the model calls one tool, receives the result, then decides it needs another tool based on what came back. This requires multiple loop iterations. A question like "find the most severe open bug and draft a response" can't call get_issue_details until it knows which issue ID to look up from the search results.
This is exactly why the routing layer needs to be a while loop, not an if/else. A single branch only handles one round of tool calls. The loop keeps running until stop_reason === "end_turn" — meaning the model is done and ready to generate a final response.
while (true) {
const response = await streamAndCollect(messages, tools);
messages.push({ role: "assistant", content: response.content });
if (response.stop_reason === "end_turn") break;
// execute tool calls, append results, loop
const toolResults = await executeTools(response.content);
messages.push({ role: "user", content: toolResults });
}
Why System Prompt Ordering Is Not Optional
Tool descriptions tell the model what each tool does. The system prompt tells it when and in what order to use them. Both are required for reliable behavior — and confusing the two is one of the most common failure modes in production agent systems.
Without explicit instructions, the model tends to answer policy questions from its training data instead of actually searching the handbook. The answers sound confident and can be completely wrong.
system: `Guidelines:
- Always search the handbook before answering policy questions — don't rely on your own knowledge
- If a question involves a specific employee, look them up first
- If a question needs both employee info and a policy, use both tools
- If you can't find relevant information, say so — don't make things up`
The tool description handles the "what." The system prompt handles the "when" and "in what order." Omitting the system prompt guidance means you're hoping the model infers the right sequencing from tool descriptions alone — which it sometimes will and often won't.
The Cross-Referencing Behavior
The most interesting result from testing this module: when asked about PTO for an employee with three years of tenure, the model:
Called
get_employee_infoto confirm tenure (3 years)Called
search_handbookto retrieve the PTO policyMatched the tenure against the correct policy bracket (2–5 years = 20 days)
Added the remaining balance by combining
pto_balanceandpto_usedfrom the employee data — without being asked
No single tool returned that answer. The model synthesized two results into a response richer than what either source contained alone. That synthesis across data sources — rather than lookup from one — is the core value of the agent pattern over a fixed retrieval pipeline.
What This Looks Like in Production
The setup in this module is a simplified version of what real enterprise FDE deployments actually look like:
| Module tool | Production equivalent |
|---|---|
search_handbook |
Vector search over internal docs, Confluence, Notion |
get_employee_info |
HR system API (Workday, BambooHR) |
get_pto_balance |
Payroll system API |
The model doesn't care that these are mocked. The architecture is identical. Swapping a mock for a real API is one line of change in the tool execution layer.
The hard problem in this kind of work isn't the model call. It's the integrations — real auth, real rate limits, real data permissions, real failure modes. That's where most of the engineering work actually lives in production agent systems, and it's the part that's invisible in most demos.

