The LLM-as-Judge Problem — Making Automated Evaluation Reliable

Automated evaluation using an LLM sounds like an elegant solution until you understand its failure modes. The model playing the role of a teacher grading work has four well-documented ways to get it wrong. And production systems that ignore them produce grades that are either systematically lenient or systematically inconsistent.

This post documents what I learned building the grading component of my personal study system, where I had to solve this problem for real.

The Four Failure Modes

Position bias. If you ask a judge to compare two answers (A vs. B), it tends to favor whichever comes first. Swap the order and you can get the opposite verdict. For pairwise evaluation, always test with swapped order and check consistency.

Verbosity bias. Longer, more confident-sounding answers score higher even when they're less accurate. The judge rewards the appearance of thoroughness. This is particularly dangerous in code review — a long but wrong implementation can outscore a short but correct one.

Self-preference bias. A model tends to rate outputs from itself (or similar models) more favorably than outputs from other models. If the same model generates the problem and grades the solution, this bias is active.

Sycophancy. If the prompt gives any hint of what answer you want, the model leans that way. "Confirm this is correct" versus "evaluate this objectively" produces meaningfully different results even when grading identical content.

The Design Decisions I Made

Rubric Decomposition Over Holistic Judgment

Instead of asking "is this a good solution?", every criterion is an atomic yes/no check with a specific description:

{
  "label": "Edge cases handled",
  "points": 2,
  "description": "Handles empty input, null/None, zero, or boundary values without crashing or returning wrong output.",
  "evaluation_type": "cascading",
  "evaluation_dependency": "correct_output"
}

Two things to note here. First, the description gives the model a concrete test to apply, not a judgment call to make. Second, the evaluation_type and evaluation_dependency fields encode a relationship I discovered was being missed: edge case handling is meaningless to evaluate if the primary output is wrong. Adding cascading dependencies prevents the judge from awarding points for edge cases in a solution that doesn't produce correct output for the happy path.

This dependency modeling took several iterations to get right. Initially I omitted it, and the grader was being too lenient on incorrect solutions because it evaluated each criterion independently.

Two-Pass Evaluation With Framing

To combat sycophancy and improve consistency, I run the grading prompt twice with different framings:

Strengths framing: "Focus on what the learner is doing well"
Gaps framing: "Focus on what the learner is missing or could fail on"

The gaps framing is dominant — it's the one that drives the final score. If the difference between the two passes exceeds 5% of the total point value, the result is flagged as uncertain. The threshold is arbitrary but functions as an early signal that the prompt needs tuning, since high discrepancy usually means the criteria descriptions are ambiguous.

The tradeoff I accepted: this approach isn't maximally accurate, but it's calibrated for the right direction. For a personal learning tool, being harder on gaps than strengths is a reasonable bias — I want the system to catch things I missed, not validate things I did right.

Criterion Descriptions Take Most of the Work

The most time-consuming part of building this wasn't the code — it was iterating on the criterion descriptions until the grader produced sensible results. One specific example: the time complexity criterion was initially docking points for an O(2n) solution, claiming a more efficient approach existed, when no such approach was possible. Two fixes resolved this:

Added explicit language to the description: "If a simpler approach has the same Big-O complexity class, prefer that. Only flag if a significantly better class is straightforward to achieve."
Added explicit language about O(n) and O(2n) being the same complexity class.

The underlying issue was the model hallucinating an O(n) approach that didn't exist and penalizing the solution accordingly. Clear criterion descriptions prevent this by constraining the model's judgment to what's actually being measured.

Chain-of-Thought Before the Verdict

The prompt explicitly asks for reasoning before the score:

## Instructions
1. For each rubric criterion, reason through whether the learner's answer satisfies it.
2. Assign points: full points if satisfied, 0 if not.
3. Return your response as a JSON object in exactly this format, with no other text.

Asking the model to reason first measurably reduces snap judgments. The reasoning also surfaces when the model is confused — if the reasoning field contains contradictory logic but is_satisfied is true, that's a signal to revisit the criterion description.

The Broader Point for Production Systems

Every production LLM evaluation system faces this problem in some form. Whether you're evaluating RAG retrieval quality, agent decision quality, or generated content — you need automated scoring you can trust.

The patterns I used here (rubric decomposition, multi-pass with framing, explicit reasoning before verdict, dependency modeling) all generalize. They're not study-buddy-specific. They're the standard toolkit for making LLM-as-judge reliable enough to act on.

The LLM-as-Judge Problem — Making Automated Evaluation Reliable

The Four Failure Modes

The Design Decisions I Made

Rubric Decomposition Over Holistic Judgment

Two-Pass Evaluation With Framing

Criterion Descriptions Take Most of the Work

Chain-of-Thought Before the Verdict

The Broader Point for Production Systems

Comments

Study Buddy Project

Designing an LLM System That Actually Solves a Real Problem

More from this blog

Cost & Latency Tracking — What the Token Counts Were Telling Me All Along

Error Handling in LLM Systems — Three Categories, One Decision Tree

Topic Suggestion — Designing a Function That Knows What to Recommend Without Magic Numbers

Streaming Structured Output — Incremental JSON Rendering Without a Parser

Evals — Why a Bad Eval Is Worse Than No Eval

Command Palette

The Four Failure Modes

The Design Decisions I Made

Rubric Decomposition Over Holistic Judgment

Two-Pass Evaluation With Framing

Criterion Descriptions Take Most of the Work

Chain-of-Thought Before the Verdict

The Broader Point for Production Systems

Comments

Study Buddy Project

Designing an LLM System That Actually Solves a Real Problem

More from this blog