Skip to main content

Command Palette

Search for a command to run...

Evals — Why a Bad Eval Is Worse Than No Eval

Updated
6 min read
D
Senior engineer with deep roots in React, TypeScript, and production-scale UI — including 4 years at Apple. Now focused on applied AI engineering: working hands-on with LLM APIs, RAG pipelines, agents, and the full stack of tooling that modern AI products are built on. Writing here to document the build, share what's non-obvious, and connect with teams working on hard AI problems.

The first time I ran evals on my LLM playground, the agent scored lower than basic RAG — 7.5 vs 8.9. The agent is more capable: it has live tool data on top of retrieval, it can answer questions basic RAG can't, and it routes intelligently between data sources. So a lower score looked wrong.

It was. The system wasn't broken. The eval was.

That distinction — system problem vs. eval problem — is the most important thing to understand about building reliable LLM systems. Without it, you end up "fixing" things that aren't broken while the actual issues go unmeasured.

What Evals Actually Are

An eval is a script that runs your AI system against a fixed set of known inputs and scores the outputs. The inputs are test cases. The scores tell you whether the system is getting better or worse as you make changes — prompts, retrieval strategy, model version, chunk size, similarity threshold.

Without evals, every change is a guess. With evals, you can say: "I changed the retrieval threshold from 0.3 to 0.4 and citation quality went from 6.2 to 7.8 on average." That's the difference between iterating on intuition and iterating on evidence.

The eval setup here runs against two systems: 03-rag-basic (fixed retrieval pipeline) and 06-tool-use-rag (agent that decides which tools to call). Ten cases each, scored across three dimensions: accuracy, citation quality, and confidence.

The Test Case Structure

Each test case captures what the system should do, not just what it should say:

type EvalCase = {
  id: string;
  question: string;
  expected_topics: string[];  // what the answer should cover
  should_use_tool?: string;   // which tool should fire, if any
  case_type: "retrieval" | "live_data";
};

The case_type field turned out to be the most important one — and it wasn't in the original design. More on that shortly.

The Scorer: LLM-as-Judge at the Eval Layer

Scoring each response manually would take hours across a full test suite. The solution is the same pattern from 04-structured-output: force the model to return a structured score using tool_choice.

type EvalScore = {
  accuracy: { score: number; reasoning: string };
  citation_quality: { score: number; reasoning: string };
  confidence: { score: number; reasoning: string };
  verdict: "pass" | "fail";
  overall_score: number;
  flags: string[];
};

LLM-as-judge isn't perfect, but it's scalable. Twenty cases run in a few minutes. The reasoning field in each dimension is as important as the score — it's what tells you whether the model is applying the criteria correctly or drifting.

The False Negative Problem

First run results before any fixes:

agent-03: How much PTO does alice.bob have remaining?       3/10
agent-06: What is alice.bob's role and department?          1.5/10
agent-09: How much PTO does alice.bob get given his tenure? 4/10

All three failures were employee lookups. The agent called the right tools, got the correct data, and returned accurate answers. The scorer flagged them for missing_citation and insufficient_sources — because it was looking for document-style citations, and these responses came from a mock HR API, not retrieved documents.

The system was behaving correctly. The eval was scoring it wrong.

This is a false negative: a working system failing on a bad criterion. It's more dangerous than a false positive because it actively misdirects you. If I'd trusted those numbers and started investigating why the agent's retrieval was weak, I would have been debugging a problem that didn't exist while the actual citation gap in retrieval responses went unaddressed.

The Fix: Case-Aware Scoring

The solution was passing case_type into the scorer so it could apply different citation criteria depending on the data source:

const citationCriteria = caseType === "live_data"
  ? "citation_quality: Does the answer accurately reflect the tool data? Explicit document citations are NOT expected for live data responses."
  : "citation_quality: Does the answer explicitly cite sources inline (e.g. [1], [2])?";

After the fix, both systems hit 100% pass rate. The agent's overall delta dropped from -1.5 to -0.8 — a real gap, but a much smaller and correctly measured one.

What the Final Numbers Actually Tell You

REPORT: Basic RAG (03)
─────────────────────
Overall score:    9.1/10   Pass rate: 100%   Avg latency: 8.6s
Accuracy:         9.3/10
Citation quality: 9.3/10
Confidence:       8.6/10

REPORT: Handbook Agent (06)
────────────────────────────
Overall score:    8.3/10   Pass rate: 100%   Avg latency: 12.3s
Accuracy:         9.8/10
Citation quality: 6.4/10
Confidence:       8.7/10

The remaining gap is entirely in citation quality (9.3 vs 6.4). Everything else is equal or the agent is better — accuracy is actually higher (9.3 vs 9.8) because it can pull live data that retrieval alone can't access. The citation gap is real and fixable with a targeted system prompt addition, the same fix applied to the RAG system already.

The insufficient_sources flags on two RAG cases are genuine — those questions aren't well covered by the document corpus. That's a knowledge gap in the data, not an eval artifact, and it's worth the distinction.

The Latency Tradeoff

Agent averages 12.3s vs RAG's 8.6s. The agent makes more API calls: query, tool executions, and final generation vs. query and generation. In production, the mitigations are caching tool results, parallelizing independent calls where possible, and setting latency budgets per query type. But the cost is real and has to be explicitly accounted for in system design — not discovered after launch.

The Broader Principle

The eval design is as important as the system being evaluated. Criteria that don't match the system's actual behavior produce scores that are meaningless at best and actively misleading at worst.

When something fails in an eval, the first question shouldn't be "what's wrong with my system?" It should be "is this a system problem or an eval problem?" Answering that correctly is what separates teams that iterate productively from teams that spin on phantom issues.

The pattern that keeps eval problems from compounding:

  • Separate test cases by response type — retrieval responses and live-data responses have different quality criteria

  • Read the reasoning, not just the score — the scorer's reasoning field shows when it's applying criteria incorrectly

  • Flag patterns, not individual failures — three failures on employee lookups is a signal about the eval design, not three independent system bugs

  • Distinguish real gaps from eval artifacts — the citation gap in the agent is real; the original 1.5/10 on employee lookups was not

5 views

More from this blog

D

David Hahn | Applied AI Engineering

11 posts

Real implementations, real bugs, and the mental models behind building on LLMs. I'm a fullstack engineer with 10+ years of experience, including 4 years at Apple, now going deep on applied AI: streaming, RAG, tool use, agents, and the full stack of tooling modern AI products are built on.