Skip to main content

Command Palette

Search for a command to run...

Tool Use + RAG — When Retrieval Becomes a Decision

Updated
5 min read
D
Senior engineer with deep roots in React, TypeScript, and production-scale UI — including 4 years at Apple. Now focused on applied AI engineering: working hands-on with LLM APIs, RAG pipelines, agents, and the full stack of tooling that modern AI products are built on. Writing here to document the build, share what's non-obvious, and connect with teams working on hard AI problems.

In the basic RAG module, retrieval was unconditional. Every query triggered the same pipeline: embed the query, search the vector store, stuff the top chunks into the prompt, generate. The model had no say in whether retrieval was even appropriate.

That works fine for a single-purpose search tool. It breaks down the moment you have multiple data sources, multiple tools, and questions that don't all need the same retrieval path.

The fix is conceptually simple: make RAG a tool. Instead of retrieval being a mandatory pre-processing step, it becomes one capability among many that the model can choose to invoke — or skip — based on what the question actually needs.

What Changes

In 03-rag-basic, the flow was linear and predetermined:

query → embed → retrieve → prompt → generate

In 06-tool-use-rag, the flow is dynamic:

query → model decides → [search_handbook? get_employee_info? get_pto_balance? none?] → generate

Given the question "How much PTO does alice.bob have left?", the model might call get_employee_info and get_pto_balance in parallel. Given "What's our parental leave policy?", it calls only search_handbook. Given "What does RAG stand for?", it answers directly without calling anything.

That routing decision — which is made by the model, not hardcoded — is what makes the agent pattern fundamentally more capable than a fixed retrieval pipeline.

Parallel vs. Sequential Tool Calls

This is one of the more important distinctions to understand for production agent work.

Parallel: the model calls multiple tools in a single turn because it already knows what it needs and the results don't depend on each other. Both calls go out simultaneously, both results come back, the model synthesizes them into a single response. This is the common case for questions that span data sources.

Sequential: the model calls one tool, receives the result, then decides it needs another tool based on what came back. This requires multiple loop iterations. A question like "find the most severe open bug and draft a response" can't call get_issue_details until it knows which issue ID to look up from the search results.

This is exactly why the routing layer needs to be a while loop, not an if/else. A single branch only handles one round of tool calls. The loop keeps running until stop_reason === "end_turn" — meaning the model is done and ready to generate a final response.

while (true) {
  const response = await streamAndCollect(messages, tools);
  messages.push({ role: "assistant", content: response.content });

  if (response.stop_reason === "end_turn") break;

  // execute tool calls, append results, loop
  const toolResults = await executeTools(response.content);
  messages.push({ role: "user", content: toolResults });
}

Why System Prompt Ordering Is Not Optional

Tool descriptions tell the model what each tool does. The system prompt tells it when and in what order to use them. Both are required for reliable behavior — and confusing the two is one of the most common failure modes in production agent systems.

Without explicit instructions, the model tends to answer policy questions from its training data instead of actually searching the handbook. The answers sound confident and can be completely wrong.

system: `Guidelines:
- Always search the handbook before answering policy questions — don't rely on your own knowledge
- If a question involves a specific employee, look them up first
- If a question needs both employee info and a policy, use both tools
- If you can't find relevant information, say so — don't make things up`

The tool description handles the "what." The system prompt handles the "when" and "in what order." Omitting the system prompt guidance means you're hoping the model infers the right sequencing from tool descriptions alone — which it sometimes will and often won't.

The Cross-Referencing Behavior

The most interesting result from testing this module: when asked about PTO for an employee with three years of tenure, the model:

  1. Called get_employee_info to confirm tenure (3 years)

  2. Called search_handbook to retrieve the PTO policy

  3. Matched the tenure against the correct policy bracket (2–5 years = 20 days)

  4. Added the remaining balance by combining pto_balance and pto_used from the employee data — without being asked

No single tool returned that answer. The model synthesized two results into a response richer than what either source contained alone. That synthesis across data sources — rather than lookup from one — is the core value of the agent pattern over a fixed retrieval pipeline.

What This Looks Like in Production

The setup in this module is a simplified version of what real enterprise FDE deployments actually look like:

Module tool Production equivalent
search_handbook Vector search over internal docs, Confluence, Notion
get_employee_info HR system API (Workday, BambooHR)
get_pto_balance Payroll system API

The model doesn't care that these are mocked. The architecture is identical. Swapping a mock for a real API is one line of change in the tool execution layer.

The hard problem in this kind of work isn't the model call. It's the integrations — real auth, real rate limits, real data permissions, real failure modes. That's where most of the engineering work actually lives in production agent systems, and it's the part that's invisible in most demos.

More from this blog

D

David Hahn | Applied AI Engineering

10 posts

Real implementations, real bugs, and the mental models behind building on LLMs. I'm a fullstack engineer with 10+ years of experience, including 4 years at Apple, now going deep on applied AI: streaming, RAG, tool use, agents, and the full stack of tooling modern AI products are built on.