Adding Spaced Repetition to an LLM Study System — SM-2, Schema Design, and a Scoring Problem Worth Solving

The first three phases of my study system handled the session loop: generate a problem, grade the solution, log the result. Useful, but with a fundamental gap — there was no memory of what I struggled with, and no mechanism to resurface it at the right time.

Phase 4 adds spaced repetition via the SM-2 algorithm. The implementation forced three interesting design decisions: a schema change that clarified a data modeling confusion, a scoring conversion that rejected the obvious naive approach, and a function boundary that made the scheduling logic actually testable.

What SM-2 Does

SM-2 is the algorithm behind Anki. Instead of reviewing problems on a fixed schedule, each problem is scheduled based on how well you did last time. Struggle → see it again soon. Do well → longer gap before it resurfaces.

Each problem tracks three persistent values:

interval — days until the next review
ease_factor — a multiplier that controls how fast the interval grows (starts at 2.5, floors at 1.3)
repetitions — how many consecutive successful reviews have happened

After each session, a 0–5 quality score drives the update. Scores below 3 reset the interval and repetition count. Scores of 3 and above grow the interval by the ease factor.

The result is a scheduling system that adapts to actual performance — problems you consistently get right disappear from the queue for weeks; problems you keep missing come back the next day.

The Schema Problem: Two Tables, Two Concerns

Before Phase 4, the system had one table: sessions. Each row was a logged study session with topic, difficulty, and grading results. Adding SM-2 to that table would mean mixing two fundamentally different concerns.

Sessions are historical records — immutable, append-only, one row per session. SM-2 state is current state — mutable, one row per problem, updated in place after every review. Mixing them would produce a table where some columns are historical facts and others are running totals, with no clean way to query either correctly.

The solution is a problems table that holds one row per generated problem. SM-2 fields (interval, ease_factor, repetitions, next_review_date) start as NULL on first insert and get updated in place after each session. sessions gets a problem_id foreign key and drops the fields now derivable via join.

An earlier design considered storing a new SM-2 row after each session to preserve a history of scheduling states. The problem: there's no practical query that needs past SM-2 states. The only thing the scheduler ever needs is the current state — what's the next review date, what's the current ease factor. Keeping historical rows would add storage cost and query complexity for zero benefit.

The Scoring Conversion Problem

SM-2 expects a 0–5 quality rating with specific semantics: 0–2 means the review failed, 3–5 means it passed. That threshold is what drives whether the interval resets or grows — so the mapping from session score to SM-2 score has to respect it.

The naive approach is a linear mapping: round(total_score / max_score * 5). The problem is that rubrics with multiple criteria rarely produce scores near 0 even for completely wrong answers. A solution with correct structure but wrong output might still score 40–50% on the rubric because it passes naming, modularity, and style criteria. A linear mapping compresses everything into the 2–3 range and makes the threshold meaningless.

The correct approach uses correct_output as the gate:

def calculate_initial_sm2_score(
    is_correct: bool, is_uncertain: bool, total_score: int, max_score: int
) -> int:
    sm2_score = math.floor(total_score / max_score * 2)
    if is_correct:
        sm2_score += 3
    if is_uncertain and sm2_score > 0:
        sm2_score -= 1
    return sm2_score

If correct_output failed, the score lands in 0–2 regardless of how the other criteria performed. If it passed, it starts at 3 and scales up based on the remaining criteria. The is_uncertain flag (set when the two grading passes diverge by more than 5%) decrements the score by 1 — a signal that the grade itself isn't reliable, which should slow the scheduling down rather than reward it.

Criteria with evaluation_dependency: "correct_output" — things like edge case handling — are excluded from the secondary score entirely. Including them when correctness failed would double-penalize and distort the 0–2 range.

SM-2 as a Pure Function

The actual SM-2 calculation is implemented as a pure function: takes current state + the 0–5 score, returns updated state, touches no database.

def calculate_sm2(
    repetitions: int,
    interval: int,
    ease_factor: float,
    quality: int
) -> SM2Result:
    if quality < 3:
        return SM2Result(repetitions=0, interval=1, ease_factor=ease_factor)

    new_ease = max(1.3, ease_factor + 0.1 - (5 - quality) * 0.08)

    if repetitions == 0:
        new_interval = 1
    elif repetitions == 1:
        new_interval = 6
    else:
        new_interval = round(interval * ease_factor)

    return SM2Result(
        repetitions=repetitions + 1,
        interval=new_interval,
        ease_factor=new_ease
    )

The caller (log_session()) handles DB reads and writes. First-review defaults are resolved before the function is called:

ease_factor = problem["ease_factor"] if problem["ease_factor"] is not None else 2.5
repetitions = problem["repetitions"] if problem["repetitions"] is not None else 0
interval = problem["interval"] if problem["interval"] is not None else 0

Keeping defaults out of the pure function means the function is fully testable without a database — pass in any combination of state and score, verify the output.

One Date Decision That Matters

An earlier version calculated next_review_date as the old due date plus the new interval. If a problem was due on Monday and reviewed on Friday, the next review would be scheduled from Monday — not Friday.

That's wrong. Spaced repetition should adapt to when the review actually happened, not when it was supposed to happen. Late reviews shouldn't cascade into further compressed scheduling. next_review_date is always calculated from today.

What's Still Not Wired Up

Phase 4 adds the schema, the scoring conversion, and the SM-2 calculation. Two things are explicitly deferred:

main.py hasn't been updated yet — it still uses the old log_session() signature. End-to-end wiring is blocked on resetting seed data for the new schema, which is a Phase 5 task.

get_due_problems() exists as a function but isn't exposed to the user yet. It returns all problems where next_review_date <= today, and accepts an anchor_date argument so it's testable with arbitrary dates. Surfacing it in the session flow — and deciding whether a session should start with due reviews or new problems — is also Phase 5.

The deferred scope is intentional. Phase 4's job was getting the scheduling logic right. Wiring it into the user flow is the next phase's job.

Adding Spaced Repetition to an LLM Study System — SM-2, Schema Design, and a Scoring Problem Worth Solving

What SM-2 Does

The Schema Problem: Two Tables, Two Concerns

The Scoring Conversion Problem

SM-2 as a Pure Function

One Date Decision That Matters

What's Still Not Wired Up

Comments

Study Buddy Project

Prompt Engineering Lessons from Building a Problem Generator

More from this blog

Tool Use + RAG — When Retrieval Becomes a Decision

Python for JavaScript Engineers — A Practical Mental Model

Prompt Engineering Lessons from Building a Problem Generator

Structured Output — When to Use Prompting vs. Forced Tool Use

Command Palette

What SM-2 Does

The Schema Problem: Two Tables, Two Concerns

The Scoring Conversion Problem

SM-2 as a Pure Function

One Date Decision That Matters

What's Still Not Wired Up

Comments

Study Buddy Project

Prompt Engineering Lessons from Building a Problem Generator

More from this blog