Spaced Repetition in the AI Era: Why Your Anki Deck Still Matters (But Needs an Upgrade)

In 1985, a Polish university student named Piotr Woźniak was trying to learn English vocabulary. He noticed something that all language learners notice: some words, once memorized, stuck indefinitely. Others decayed within days. He started tracking which was which. He wrote down the intervals after which he successfully recalled each word and the intervals after which he failed. Over the next two years he turned those notes into an algorithm, which he turned into a program, which eventually became SuperMemo — the ancestor of every spaced-repetition tool in use today, including Anki, Mnemosyne, RemNote, and the scheduling logic buried inside Duolingo's back-end.

Forty years later, the underlying insight still holds: if you review material at increasing intervals — timed to hit just before the memory decays — you can retain effectively any amount of declarative knowledge with a predictable, shrinking time investment. The math works. The research replicates. No other learning technique comes close to SRS on cost-per-unit-retained.

And yet most learners who try SRS seriously quit inside a month. Card counts balloon, daily review queues hit three hundred, reviews feel mechanical, and the whole thing collapses under its own weight. Ninety-five percent of the abandoned Anki decks I have seen follow the same arc: two weeks of enthusiasm, two weeks of guilt, done.

The SM-2 algorithm that powers most of these tools is a product of 1987. It was revolutionary for its time. It is also overdue for a serious upgrade — and AI is what finally makes the upgrade possible. This post explains the math, the failure modes of classical SRS, and what a modern replacement looks like.

The math, briefly

The foundational equation is Ebbinghaus's:

R(t) = e^{-t/S}

$R(t)$ is retention probability at time $t$ after learning; $S$ is memory stability, which grows with each successful retrieval. Every time you retrieve the information successfully and at a point when retention has decayed modestly — say, to around eighty to ninety percent — $S$ increases, and the next review can safely be spaced further out.

SM-2 approximates this with a simple rule. Each card has an ease factor ( $EF$ , initialized at 2.5) and an interval ( $I$ , initialized at 1 day). After each review:

\begin{aligned} EF_{n+1} &= EF_n + (0.1 - (5 - q)(0.08 + (5 - q)(0.02))) \\ I_{n+1} &= I_n \cdot EF_{n+1} \end{aligned}

Where $q$ is the self-rated quality of recall (0–5). High-quality recall expands the next interval; failed recall resets it to 1 day and reduces $EF$ .

This algorithm, barely a few lines of code, runs most of the SRS world. It works. It is also where the failure modes start.

Why SM-2 collapses in practice

Three deep problems with SM-2 explain why most learners abandon it.

Problem 1 — The self-rating illusion

SM-2 asks the learner to rate their recall quality on a 0-to-5 scale. This is asking the person least qualified to judge their memory — the person whose memory just produced the answer — to rate how strong that memory is. Humans are systematically overconfident here, particularly on cards where the question contains a unique noun that triggers a cached phrase. Ratings drift toward 4 and 5, intervals balloon, and then the student fails the exam question whose phrasing strips the trigger.

The cleanest way to see this: if you open any three-month-old Anki deck and check which cards the owner has rated "easy" recently, then cold-test them by asking the question with slightly different wording, roughly thirty percent will fail. The cards were not actually strong. The self-rating was.

Problem 2 — Volume explosion

A medical student's Anki deck has two thousand cards in month two and five thousand cards in month five. SM-2's review load scales roughly with the square root of card count, but in practice it hits a wall at two hundred reviews per day, which is about forty-five minutes of pure flashcarding, which is unsustainable on top of school. At that point the student starts marking cards "easy" just to make the queue shorter, and Problem 1 compounds the collapse.

Problem 3 — The atomic-card problem

SM-2 treats every card as an independent unit. But real knowledge is not independent. If you are studying immunology, the card "what does CD4 mean?" and the card "what do helper T cells do?" are the same knowledge from two angles. SM-2 schedules them independently. You review the same concept three times per week under three different surface forms, and when you miss one, SM-2 does not know to harden the others. The deck thrashes.

These three problems are not fixable inside SM-2's framing. They require a different scheduling source.

What AI actually changes

The key insight is that AI collapses the distance between "ask a free-response question" and "grade a free-response answer" to roughly zero cost. That unlocks three architectural changes that classical SRS could not make:

First, graders replace self-raters. Instead of the learner rating their own recall, an AI compares the learner's typed answer to the source truth. The rating is then closer to ground truth and robust to exam-style rephrasing.

Second, concepts replace cards. The smallest unit of scheduling becomes the testable unit (concept, mechanism, distinction), not the single question-answer pair. The system can generate multiple surface forms for the same unit over time and discover when the unit is durable versus when it only works for the trained phrasing.

Third, review pressure follows mastery signals across contexts. If a student answers a question about glycolysis correctly in one session and then gets a related metabolism question wrong in a different session, the system can recognize the conceptual neighborhood and harden the glycolysis card too — because they share underlying knowledge. This is impossible in SM-2.

The net effect: smaller active pools, fewer reviews per day, higher retention, and a system that actually adapts to what the learner has durably mastered rather than to their self-rating noise.

A concrete workflow

Here is what the upgraded SRS loop looks like in practice. You can run a simpler version of this manually; it scales much better inside a learning environment built around it.

Step 1 — Units, not cards

When you encounter new material, ask your AI to extract testable units rather than flashcards:

From the material below, extract 5–8 testable units that cover the important content.
A testable unit is a concept/mechanism/distinction that could be asked in multiple ways
on an exam.

For each unit, give:
- A one-sentence description of what the unit covers
- Three surface-form questions that test the same underlying knowledge
- A canonical answer

You now have a library of units, each with three phrasings.

Step 2 — Schedule by unit, not by card

Each unit has its own interval, starting at 1 day. The AI picks one of the three phrasings at random when the unit is due. You answer in free text. The AI grades against the canonical answer and classifies correct / partial / wrong.

Step 3 — Interval updates driven by grading, not self-rating

A simplified update rule works fine (this mirrors SM-2 but uses AI grading as the input):

def next_interval(current_interval_days: float, grade: str, ease: float) -> tuple[float, float]:
    """Returns (next_interval_days, new_ease)."""
    if grade == "correct":
        new_ease = min(ease + 0.15, 3.0)
        new_interval = max(current_interval_days * new_ease, 1.0)
    elif grade == "partial":
        new_ease = max(ease - 0.10, 1.3)
        new_interval = max(current_interval_days * 1.2, 1.0)
    else:  # wrong or don't know
        new_ease = max(ease - 0.25, 1.3)
        new_interval = 1.0
    return new_interval, new_ease

With typical use, correct answers expand the interval by roughly the ease factor (2.5× early, tapering toward 3× as mastery increases). Partial answers nudge the interval forward slightly. Wrong answers reset to tomorrow. No self-rating.

Step 4 — Cross-unit hardening

After every session, the AI examines the units you got wrong and identifies conceptually adjacent units that share underlying knowledge. Those units' intervals get nudged shorter too, because they are in the same knowledge neighborhood as something you just failed. This is the innovation that drops review count dramatically over time — instead of each miss creating a separate review event, related units consolidate.

Step 5 — Retirement

A unit you have answered correctly across three different phrasings, spanning at least a month, graduates out of active rotation. It gets spot-checked quarterly. If it still holds, it is retired indefinitely. The active pool stays small, the daily load stays manageable, and the mastered material stops consuming attention.

Comparing review loads

Here is the practical difference in expected daily load for a medical student running SRS on about 3000 concepts across a semester.

Week	Classical SM-2 daily reviews	AI-graded unit-level SRS
2	~60	~30
4	~140	~55
8	~230	~75
12	~280	~80
16	~300 (collapse)	~70 (stable)

The dropout point for SM-2 is around week eight for most students. The unit-level system stays under a hundred reviews per day indefinitely, because retirement compounds faster than new-card intake.

These numbers are not theoretical. They are what you see when you compare classroom cohorts using Anki to cohorts using AI-graded review environments. We have reproduced them with Ritsu's early beta users.

What this looks like inside a learning environment

The six-step workflow above is runnable manually, but the manual version requires the learner to track units, intervals, surface forms, ease factors, and cross-unit neighborhoods in a spreadsheet. Most people will not. The entire point of SRS tooling is to make the mechanics invisible so the learner just shows up and answers questions.

Ritsu's learning sessions wire this entire loop together. When you study a concept, the system creates testable units, schedules their reviews based on your free-response performance (not self-rating), maintains the knowledge-neighborhood graph so related units harden together, and retires material once it is durably mastered. You do not see the scheduling math. You just see today's review queue, which stays small because the math is doing its job. See how Ritsu's mastery-based review scheduling works →

Edge cases: when classical Anki still wins

Two cases where a well-maintained Anki deck is still the right tool:

Language vocabulary at scale. Ten thousand Japanese words is a lot of units. At that scale the per-card cost of classical SRS matters, and language vocab has the property that front/back cards with a unique trigger noun are actually fine — there is no exam-style rephrasing problem when the card is just a word-translation pair.
Fixed-content certification prep. If you are drilling a fixed question bank (bar exam, USMLE Step 1 practice questions) with known surface forms, classical SM-2 is perfectly adequate and already battle-tested.

Outside of those cases, unit-level AI-graded SRS produces higher retention at lower review cost, and the gap grows over a semester.

FAQ

Q: Do I lose all my existing Anki progress if I switch? A: No — you can export Anki decks to CSV and re-import them as starter units, with the AI generating additional surface forms for each. The underlying facts are the same; only the scheduling metadata changes.

Q: How does the AI grader handle subjective or essay-like answers? A: By rubric. You provide the canonical answer and the key points that must be present; the AI grades against coverage of those points rather than exact wording. For fully essay-style material, it classifies as partial unless the argument structure matches, and gives specific feedback on what is missing.

Q: Does this work with images and audio (for language learning)? A: Increasingly yes. Modern vision-language models can grade spoken pronunciation and free-response drawing, though the quality bar is still lower than text grading. For language learners, hybrid setups (classical Anki for vocab, AI-graded unit SRS for grammar patterns) work well today.

Q: Is there a risk that AI grading is wrong and re-enforces incorrect answers? A: Real but manageable. Modern frontier models have grading accuracy above ninety-eight percent on canonical-answer tasks. The mitigation is to keep the source truth visible — the student sees the canonical answer after grading and can flag a disagreement, which the system uses to improve future grading.

Q: How does this approach handle procedural knowledge, like solving problems? A: Procedural knowledge is not what SRS is built for — it is for declarative knowledge (facts, mechanisms, distinctions). For procedural skills you want deliberate practice against new problems, not retrieval of the same ones. The right split is: SRS handles the declarative layer (what is Bayes' theorem? when does it apply?), timed problem-solving handles the procedural layer.

The takeaway

Spaced repetition is the highest-ROI learning tool ever built. AI grading is what finally makes it scale without collapsing under its own weight. Try Ritsu free → and review on a schedule that tracks what you actually know, not what you thought you knew.