Evals — Why a Bad Eval Is Worse Than No Eval
The first time I ran evals on my LLM playground, the agent scored lower than basic RAG — 7.5 vs 8.9. The agent is more capable: it has live tool data on top of retrieval, it can answer questions basic RAG can't, and it routes intelligently between data sources. So a lower score looked wrong.
It was. The system wasn't broken. The eval was.
That distinction — system problem vs. eval problem — is the most important thing to understand about building reliable LLM systems. Without it, you end up "fixing" things that aren't broken while the actual issues go unmeasured.
What Evals Actually Are
An eval is a script that runs your AI system against a fixed set of known inputs and scores the outputs. The inputs are test cases. The scores tell you whether the system is getting better or worse as you make changes — prompts, retrieval strategy, model version, chunk size, similarity threshold.
Without evals, every change is a guess. With evals, you can say: "I changed the retrieval threshold from 0.3 to 0.4 and citation quality went from 6.2 to 7.8 on average." That's the difference between iterating on intuition and iterating on evidence.
The eval setup here runs against two systems: 03-rag-basic (fixed retrieval pipeline) and 06-tool-use-rag (agent that decides which tools to call). Ten cases each, scored across three dimensions: accuracy, citation quality, and confidence.
The Test Case Structure
Each test case captures what the system should do, not just what it should say:
type EvalCase = {
id: string;
question: string;
expected_topics: string[]; // what the answer should cover
should_use_tool?: string; // which tool should fire, if any
case_type: "retrieval" | "live_data";
};
The case_type field turned out to be the most important one — and it wasn't in the original design. More on that shortly.
The Scorer: LLM-as-Judge at the Eval Layer
Scoring each response manually would take hours across a full test suite. The solution is the same pattern from 04-structured-output: force the model to return a structured score using tool_choice.
type EvalScore = {
accuracy: { score: number; reasoning: string };
citation_quality: { score: number; reasoning: string };
confidence: { score: number; reasoning: string };
verdict: "pass" | "fail";
overall_score: number;
flags: string[];
};
LLM-as-judge isn't perfect, but it's scalable. Twenty cases run in a few minutes. The reasoning field in each dimension is as important as the score — it's what tells you whether the model is applying the criteria correctly or drifting.
The False Negative Problem
First run results before any fixes:
agent-03: How much PTO does alice.bob have remaining? 3/10
agent-06: What is alice.bob's role and department? 1.5/10
agent-09: How much PTO does alice.bob get given his tenure? 4/10
All three failures were employee lookups. The agent called the right tools, got the correct data, and returned accurate answers. The scorer flagged them for missing_citation and insufficient_sources — because it was looking for document-style citations, and these responses came from a mock HR API, not retrieved documents.
The system was behaving correctly. The eval was scoring it wrong.
This is a false negative: a working system failing on a bad criterion. It's more dangerous than a false positive because it actively misdirects you. If I'd trusted those numbers and started investigating why the agent's retrieval was weak, I would have been debugging a problem that didn't exist while the actual citation gap in retrieval responses went unaddressed.
The Fix: Case-Aware Scoring
The solution was passing case_type into the scorer so it could apply different citation criteria depending on the data source:
const citationCriteria = caseType === "live_data"
? "citation_quality: Does the answer accurately reflect the tool data? Explicit document citations are NOT expected for live data responses."
: "citation_quality: Does the answer explicitly cite sources inline (e.g. [1], [2])?";
After the fix, both systems hit 100% pass rate. The agent's overall delta dropped from -1.5 to -0.8 — a real gap, but a much smaller and correctly measured one.
What the Final Numbers Actually Tell You
REPORT: Basic RAG (03)
─────────────────────
Overall score: 9.1/10 Pass rate: 100% Avg latency: 8.6s
Accuracy: 9.3/10
Citation quality: 9.3/10
Confidence: 8.6/10
REPORT: Handbook Agent (06)
────────────────────────────
Overall score: 8.3/10 Pass rate: 100% Avg latency: 12.3s
Accuracy: 9.8/10
Citation quality: 6.4/10
Confidence: 8.7/10
The remaining gap is entirely in citation quality (9.3 vs 6.4). Everything else is equal or the agent is better — accuracy is actually higher (9.3 vs 9.8) because it can pull live data that retrieval alone can't access. The citation gap is real and fixable with a targeted system prompt addition, the same fix applied to the RAG system already.
The insufficient_sources flags on two RAG cases are genuine — those questions aren't well covered by the document corpus. That's a knowledge gap in the data, not an eval artifact, and it's worth the distinction.
The Latency Tradeoff
Agent averages 12.3s vs RAG's 8.6s. The agent makes more API calls: query, tool executions, and final generation vs. query and generation. In production, the mitigations are caching tool results, parallelizing independent calls where possible, and setting latency budgets per query type. But the cost is real and has to be explicitly accounted for in system design — not discovered after launch.
The Broader Principle
The eval design is as important as the system being evaluated. Criteria that don't match the system's actual behavior produce scores that are meaningless at best and actively misleading at worst.
When something fails in an eval, the first question shouldn't be "what's wrong with my system?" It should be "is this a system problem or an eval problem?" Answering that correctly is what separates teams that iterate productively from teams that spin on phantom issues.
The pattern that keeps eval problems from compounding:
Separate test cases by response type — retrieval responses and live-data responses have different quality criteria
Read the reasoning, not just the score — the scorer's reasoning field shows when it's applying criteria incorrectly
Flag patterns, not individual failures — three failures on employee lookups is a signal about the eval design, not three independent system bugs
Distinguish real gaps from eval artifacts — the citation gap in the agent is real; the original 1.5/10 on employee lookups was not

