Skip to main content

Command Palette

Search for a command to run...

Adding Spaced Repetition to an LLM Study System — SM-2, Schema Design, and a Scoring Problem Worth Solving

Updated
6 min read
D
Senior engineer with deep roots in React, TypeScript, and production-scale UI — including 4 years at Apple. Now focused on applied AI engineering: working hands-on with LLM APIs, RAG pipelines, agents, and the full stack of tooling that modern AI products are built on. Writing here to document the build, share what's non-obvious, and connect with teams working on hard AI problems.

The first three phases of my study system handled the session loop: generate a problem, grade the solution, log the result. Useful, but with a fundamental gap — there was no memory of what I struggled with, and no mechanism to resurface it at the right time.

Phase 4 adds spaced repetition via the SM-2 algorithm. The implementation forced three interesting design decisions: a schema change that clarified a data modeling confusion, a scoring conversion that rejected the obvious naive approach, and a function boundary that made the scheduling logic actually testable.

What SM-2 Does

SM-2 is the algorithm behind Anki. Instead of reviewing problems on a fixed schedule, each problem is scheduled based on how well you did last time. Struggle → see it again soon. Do well → longer gap before it resurfaces.

Each problem tracks three persistent values:

  • interval — days until the next review

  • ease_factor — a multiplier that controls how fast the interval grows (starts at 2.5, floors at 1.3)

  • repetitions — how many consecutive successful reviews have happened

After each session, a 0–5 quality score drives the update. Scores below 3 reset the interval and repetition count. Scores of 3 and above grow the interval by the ease factor.

The result is a scheduling system that adapts to actual performance — problems you consistently get right disappear from the queue for weeks; problems you keep missing come back the next day.

The Schema Problem: Two Tables, Two Concerns

Before Phase 4, the system had one table: sessions. Each row was a logged study session with topic, difficulty, and grading results. Adding SM-2 to that table would mean mixing two fundamentally different concerns.

Sessions are historical records — immutable, append-only, one row per session. SM-2 state is current state — mutable, one row per problem, updated in place after every review. Mixing them would produce a table where some columns are historical facts and others are running totals, with no clean way to query either correctly.

The solution is a problems table that holds one row per generated problem. SM-2 fields (interval, ease_factor, repetitions, next_review_date) start as NULL on first insert and get updated in place after each session. sessions gets a problem_id foreign key and drops the fields now derivable via join.

An earlier design considered storing a new SM-2 row after each session to preserve a history of scheduling states. The problem: there's no practical query that needs past SM-2 states. The only thing the scheduler ever needs is the current state — what's the next review date, what's the current ease factor. Keeping historical rows would add storage cost and query complexity for zero benefit.

The Scoring Conversion Problem

SM-2 expects a 0–5 quality rating with specific semantics: 0–2 means the review failed, 3–5 means it passed. That threshold is what drives whether the interval resets or grows — so the mapping from session score to SM-2 score has to respect it.

The naive approach is a linear mapping: round(total_score / max_score * 5). The problem is that rubrics with multiple criteria rarely produce scores near 0 even for completely wrong answers. A solution with correct structure but wrong output might still score 40–50% on the rubric because it passes naming, modularity, and style criteria. A linear mapping compresses everything into the 2–3 range and makes the threshold meaningless.

The correct approach uses correct_output as the gate:

def calculate_initial_sm2_score(
    is_correct: bool, is_uncertain: bool, total_score: int, max_score: int
) -> int:
    sm2_score = math.floor(total_score / max_score * 2)
    if is_correct:
        sm2_score += 3
    if is_uncertain and sm2_score > 0:
        sm2_score -= 1
    return sm2_score

If correct_output failed, the score lands in 0–2 regardless of how the other criteria performed. If it passed, it starts at 3 and scales up based on the remaining criteria. The is_uncertain flag (set when the two grading passes diverge by more than 5%) decrements the score by 1 — a signal that the grade itself isn't reliable, which should slow the scheduling down rather than reward it.

Criteria with evaluation_dependency: "correct_output" — things like edge case handling — are excluded from the secondary score entirely. Including them when correctness failed would double-penalize and distort the 0–2 range.

SM-2 as a Pure Function

The actual SM-2 calculation is implemented as a pure function: takes current state + the 0–5 score, returns updated state, touches no database.

def calculate_sm2(
    repetitions: int,
    interval: int,
    ease_factor: float,
    quality: int
) -> SM2Result:
    if quality < 3:
        return SM2Result(repetitions=0, interval=1, ease_factor=ease_factor)

    new_ease = max(1.3, ease_factor + 0.1 - (5 - quality) * 0.08)

    if repetitions == 0:
        new_interval = 1
    elif repetitions == 1:
        new_interval = 6
    else:
        new_interval = round(interval * ease_factor)

    return SM2Result(
        repetitions=repetitions + 1,
        interval=new_interval,
        ease_factor=new_ease
    )

The caller (log_session()) handles DB reads and writes. First-review defaults are resolved before the function is called:

ease_factor = problem["ease_factor"] if problem["ease_factor"] is not None else 2.5
repetitions = problem["repetitions"] if problem["repetitions"] is not None else 0
interval = problem["interval"] if problem["interval"] is not None else 0

Keeping defaults out of the pure function means the function is fully testable without a database — pass in any combination of state and score, verify the output.

One Date Decision That Matters

An earlier version calculated next_review_date as the old due date plus the new interval. If a problem was due on Monday and reviewed on Friday, the next review would be scheduled from Monday — not Friday.

That's wrong. Spaced repetition should adapt to when the review actually happened, not when it was supposed to happen. Late reviews shouldn't cascade into further compressed scheduling. next_review_date is always calculated from today.

What's Still Not Wired Up

Phase 4 adds the schema, the scoring conversion, and the SM-2 calculation. Two things are explicitly deferred:

main.py hasn't been updated yet — it still uses the old log_session() signature. End-to-end wiring is blocked on resetting seed data for the new schema, which is a Phase 5 task.

get_due_problems() exists as a function but isn't exposed to the user yet. It returns all problems where next_review_date <= today, and accepts an anchor_date argument so it's testable with arbitrary dates. Surfacing it in the session flow — and deciding whether a session should start with due reviews or new problems — is also Phase 5.

The deferred scope is intentional. Phase 4's job was getting the scheduling logic right. Wiring it into the user flow is the next phase's job.