Adding Spaced Repetition to an LLM Study System — SM-2, Schema Design, and a Scoring Problem Worth Solving
The first three phases of my study system handled the session loop: generate a problem, grade the solution, log the result. Useful, but with a fundamental gap — there was no memory of what I struggled with, and no mechanism to resurface it at the right time.
Phase 4 adds spaced repetition via the SM-2 algorithm. The implementation forced three interesting design decisions: a schema change that clarified a data modeling confusion, a scoring conversion that rejected the obvious naive approach, and a function boundary that made the scheduling logic actually testable.
What SM-2 Does
SM-2 is the algorithm behind Anki. Instead of reviewing problems on a fixed schedule, each problem is scheduled based on how well you did last time. Struggle → see it again soon. Do well → longer gap before it resurfaces.
Each problem tracks three persistent values:
interval— days until the next reviewease_factor— a multiplier that controls how fast the interval grows (starts at 2.5, floors at 1.3)repetitions— how many consecutive successful reviews have happened
After each session, a 0–5 quality score drives the update. Scores below 3 reset the interval and repetition count. Scores of 3 and above grow the interval by the ease factor.
The result is a scheduling system that adapts to actual performance — problems you consistently get right disappear from the queue for weeks; problems you keep missing come back the next day.
The Schema Problem: Two Tables, Two Concerns
Before Phase 4, the system had one table: sessions. Each row was a logged study session with topic, difficulty, and grading results. Adding SM-2 to that table would mean mixing two fundamentally different concerns.
Sessions are historical records — immutable, append-only, one row per session. SM-2 state is current state — mutable, one row per problem, updated in place after every review. Mixing them would produce a table where some columns are historical facts and others are running totals, with no clean way to query either correctly.
The solution is a problems table that holds one row per generated problem. SM-2 fields (interval, ease_factor, repetitions, next_review_date) start as NULL on first insert and get updated in place after each session. sessions gets a problem_id foreign key and drops the fields now derivable via join.
An earlier design considered storing a new SM-2 row after each session to preserve a history of scheduling states. The problem: there's no practical query that needs past SM-2 states. The only thing the scheduler ever needs is the current state — what's the next review date, what's the current ease factor. Keeping historical rows would add storage cost and query complexity for zero benefit.
The Scoring Conversion Problem
SM-2 expects a 0–5 quality rating with specific semantics: 0–2 means the review failed, 3–5 means it passed. That threshold is what drives whether the interval resets or grows — so the mapping from session score to SM-2 score has to respect it.
The naive approach is a linear mapping: round(total_score / max_score * 5). The problem is that rubrics with multiple criteria rarely produce scores near 0 even for completely wrong answers. A solution with correct structure but wrong output might still score 40–50% on the rubric because it passes naming, modularity, and style criteria. A linear mapping compresses everything into the 2–3 range and makes the threshold meaningless.
The correct approach uses correct_output as the gate:
def calculate_initial_sm2_score(
is_correct: bool, is_uncertain: bool, total_score: int, max_score: int
) -> int:
sm2_score = math.floor(total_score / max_score * 2)
if is_correct:
sm2_score += 3
if is_uncertain and sm2_score > 0:
sm2_score -= 1
return sm2_score
If correct_output failed, the score lands in 0–2 regardless of how the other criteria performed. If it passed, it starts at 3 and scales up based on the remaining criteria. The is_uncertain flag (set when the two grading passes diverge by more than 5%) decrements the score by 1 — a signal that the grade itself isn't reliable, which should slow the scheduling down rather than reward it.
Criteria with evaluation_dependency: "correct_output" — things like edge case handling — are excluded from the secondary score entirely. Including them when correctness failed would double-penalize and distort the 0–2 range.
SM-2 as a Pure Function
The actual SM-2 calculation is implemented as a pure function: takes current state + the 0–5 score, returns updated state, touches no database.
def calculate_sm2(
repetitions: int,
interval: int,
ease_factor: float,
quality: int
) -> SM2Result:
if quality < 3:
return SM2Result(repetitions=0, interval=1, ease_factor=ease_factor)
new_ease = max(1.3, ease_factor + 0.1 - (5 - quality) * 0.08)
if repetitions == 0:
new_interval = 1
elif repetitions == 1:
new_interval = 6
else:
new_interval = round(interval * ease_factor)
return SM2Result(
repetitions=repetitions + 1,
interval=new_interval,
ease_factor=new_ease
)
The caller (log_session()) handles DB reads and writes. First-review defaults are resolved before the function is called:
ease_factor = problem["ease_factor"] if problem["ease_factor"] is not None else 2.5
repetitions = problem["repetitions"] if problem["repetitions"] is not None else 0
interval = problem["interval"] if problem["interval"] is not None else 0
Keeping defaults out of the pure function means the function is fully testable without a database — pass in any combination of state and score, verify the output.
One Date Decision That Matters
An earlier version calculated next_review_date as the old due date plus the new interval. If a problem was due on Monday and reviewed on Friday, the next review would be scheduled from Monday — not Friday.
That's wrong. Spaced repetition should adapt to when the review actually happened, not when it was supposed to happen. Late reviews shouldn't cascade into further compressed scheduling. next_review_date is always calculated from today.
What's Still Not Wired Up
Phase 4 adds the schema, the scoring conversion, and the SM-2 calculation. Two things are explicitly deferred:
main.py hasn't been updated yet — it still uses the old log_session() signature. End-to-end wiring is blocked on resetting seed data for the new schema, which is a Phase 5 task.
get_due_problems() exists as a function but isn't exposed to the user yet. It returns all problems where next_review_date <= today, and accepts an anchor_date argument so it's testable with arbitrary dates. Surfacing it in the session flow — and deciding whether a session should start with due reviews or new problems — is also Phase 5.
The deferred scope is intentional. Phase 4's job was getting the scheduling logic right. Wiring it into the user flow is the next phase's job.

