Designing an LLM System That Actually Solves a Real Problem

Most LLM project ideas start from the technology. "What can I build with agents?" or "Let me try a RAG pipeline." That approach produces demos that are interesting for a day and abandoned by the weekend.

This series documents a different starting point: I had a real, recurring problem in my daily workflow, and I decided to build an LLM-powered system to solve it. The problem forced me into every major skill category that matters for applied AI engineering: prompt engineering, eval frameworks, agent architecture, structured output, and spaced repetition scheduling. Not as checkboxes, but as genuine design decisions with real tradeoffs.

The Problem

My daily technical practice sessions were almost entirely manual. Each morning I had to: decide what to work on, search through past notes to find problems I'd struggled with, prompt an AI step by step through a structured exercise, grade my own output, and log the session somewhere. This took meaningful time and was inconsistent. On high-friction mornings, the prep overhead itself became an excuse to skip.

The other problem was invisible: I had no spaced repetition. Problems I struggled with didn't resurface at the right time. I'd nail something in a session and not see it again for weeks, or hit the same failure mode repeatedly because I had no system tracking it.

What "actually works" looks like: wake up, run one command, get a study plan and a problem to work on. At the end of the session, submit the solution and get a graded report that tells me what I missed. Have that logged automatically so the system knows what to surface tomorrow.

Why Not Just Use an Existing Tool

LeetCode has no model of how I study. Anki handles repetition but not problem generation or grading. ChatGPT can do pieces of this but has no memory or state across sessions. The combination I needed (personalized problem generation, structured grading against my own rubric, and spaced repetition driven by actual performance data) didn't exist off the shelf. Building it was also the point: every component maps directly to applied AI skills.

The System's Six Jobs

When I broke down what the system actually needs to do, it decomposed into six functional components:

Problem generation: generate a relevant problem based on current skill gaps and history
Grading + feedback: evaluate a submitted solution against both objective and qualitative criteria
Progress tracking: automatically log sessions, scores, and patterns over time
Spaced repetition: resurface problems I struggled with at the right interval
Topic suggestion: recommend what to focus on next based on patterns in what I'm getting wrong
Schedule generation: produce a daily study plan that incorporates all of the above

The build order matters. Problem generation has the fewest unknowns and can be built with current knowledge. Grading is blocked on a design decision about evaluation reliability. Everything downstream depends on progress tracking. I sequenced the build to unblock design decisions as quickly as possible rather than building in order of appearance.

The Hardest Design Problem: Who Grades the Grader?

Before writing a line of code, I identified the blocking design decision: if the same model generates a problem AND grades the solution against it, it's evaluating its own output. Lenient generation leads to easy grades. The system becomes self-congratulatory.

This isn't just a personal project concern. It's one of the central problems in production LLM evaluation systems: how do you prevent model-generated rubrics from being too accommodating of model-generated solutions?

I researched three patterns before making a design decision:

Multi-model evaluation: use a different model to grade than the one that generated the problem. Breaks the self-preference loop.
Rubric decomposition: instead of asking "was this good?", break the evaluation into atomic yes/no checks. Harder to be lenient when each criterion has a specific description.
Adversarial test generation: prompt a model specifically to find edge cases the solution might miss, rather than just verifying the happy path.

The decision I landed on, and why, is the subject of the next post in this series.

What This Looks Like as an Architecture Decision

The FDE and applied AI roles I'm targeting care about this kind of thinking. Not "I used the Anthropic API" but "I identified a reliability problem at the system design level, researched the known solutions, and made an explicit decision with documented tradeoffs." That's the work — not the code.

The component build plan, the sequencing logic, the blocking research question — these are the artifacts of that thinking. The code comes after.

Designing an LLM System That Actually Solves a Real Problem

The Problem

Why Not Just Use an Existing Tool

The System's Six Jobs

The Hardest Design Problem: Who Grades the Grader?

What This Looks Like as an Architecture Decision

Comments

Study Buddy Project

Topic Suggestion — Designing a Function That Knows What to Recommend Without Magic Numbers

More from this blog

Cost & Latency Tracking — What the Token Counts Were Telling Me All Along

Error Handling in LLM Systems — Three Categories, One Decision Tree

Topic Suggestion — Designing a Function That Knows What to Recommend Without Magic Numbers

Streaming Structured Output — Incremental JSON Rendering Without a Parser

Evals — Why a Bad Eval Is Worse Than No Eval

Command Palette

The Problem

Why Not Just Use an Existing Tool

The System's Six Jobs

The Hardest Design Problem: Who Grades the Grader?

What This Looks Like as an Architecture Decision

Comments

Study Buddy Project

Topic Suggestion — Designing a Function That Knows What to Recommend Without Magic Numbers

More from this blog