The Four Stages of AI Training: From Pretraining to RLVR

Four Stages AI Training RLVR

Understanding why agents work now requires understanding the four-stage training evolution that brought us here.

Stage 1: Pretraining (~2020)

Raw pattern learning from massive text corpora. Broad knowledge but no instruction following. Expensive, data-hungry, produces generalist capabilities. Think: encyclopedic knowledge, no understanding of what you’re asking.

Stage 2: Supervised Fine-Tuning (~2022)

Learning by imitation — showing correct input-output pairs. The model learns to answer questions and follow directions. Limited to what humans can demonstrate. If the optimal reasoning trace isn’t obvious, SFT can’t teach it.

Stage 3: RLHF (~2022)

Reinforcement Learning from Human Feedback — humans compare outputs and say “this one is better.” Works for tone, safety, helpfulness. Bottlenecked by expensive preference labels. Gameable: models learn to produce outputs that look impressive rather than are correct. The reward signal is noisy because human preferences are inconsistent.

Stage 4: RLVR (~2025) — The Breakthrough

Reinforcement Learning from Verifiable Rewards. Instead of “which answer looks better?”, programs verify: “is this answer objectively correct?”

Three breakthrough properties:

  1. Non-gameable rewards: Code passes tests or doesn’t. Allows much longer optimization runs.
  2. Emergent reasoning: Models discover their own problem-solving strategies rather than imitating humans.
  3. High capability per dollar: Compute redirected into RL runs against verifiable rewards becomes the primary capability driver.

RLVR is the shift from “reward the vibe” to “reward the verifiable truth” — and it’s what makes the agentic revolution possible.


This is part of a comprehensive analysis. Read the full analysis on The Business Engineer.

Scroll to Top

Discover more from FourWeekMBA

Subscribe now to keep reading and get access to the full archive.

Continue reading

FourWeekMBA