The Four Stages of AI Training: From Pretraining to RLVR
Understanding why agents work now requires understanding the four-stage training evolution that brought us here.
Step-by-Step Process
1
Pretraining (~2020)
Raw pattern learning from massive text corpora. Broad knowledge but no instruction following. Expensive, data-hungry, produces generalist capabilities.
2
Supervised Fine-Tuning (~2022)
Learning by imitation — showing correct input-output pairs. The model learns to answer questions and follow directions. Limited to what humans can demonstrate.
3
RLHF (~2022)
Reinforcement Learning from Human Feedback — humans compare outputs and say "this one is better." Works for tone, safety, helpfulness. Bottlenecked by expensive preference labels.
4
RLVR (~2025) — The Breakthrough
Reinforcement Learning from Verifiable Rewards. Instead of "which answer looks better?", programs verify: "is this answer objectively correct?"
Key Insight
Reinforcement Learning from Human Feedback — humans compare outputs and say "this one is better." Works for tone, safety, helpfulness. Bottlenecked by expensive preference labels. Gameable: models learn to produce outputs that look impressive rather than are correct. The reward signal is noisy because human preferences are inconsistent.
Exec Package + Claude OS Master Skill | Business Engineer Founding Plan
FourWeekMBA x Business Engineer | Updated 2026
Understanding why agents work now requires understanding the four-stage training evolution that brought us here.
Stage 1: Pretraining (~2020)
Raw pattern learning from massive text corpora. Broad knowledge but no instruction following. Expensive, data-hungry, produces generalist capabilities. Think: encyclopedic knowledge, no understanding of what you’re asking.
Stage 2: Supervised Fine-Tuning (~2022)
Learning by imitation — showing correct input-output pairs. The model learns to answer questions and follow directions. Limited to what humans can demonstrate. If the optimal reasoning trace isn’t obvious, SFT can’t teach it.
Stage 3: RLHF (~2022)
Reinforcement Learning from Human Feedback — humans compare outputs and say “this one is better.” Works for tone, safety, helpfulness. Bottlenecked by expensive preference labels. Gameable: models learn to produce outputs that look impressive rather than are correct. The reward signal is noisy because human preferences are inconsistent.
Stage 4: RLVR (~2025) — The Breakthrough
Reinforcement Learning from Verifiable Rewards. Instead of “which answer looks better?”, programs verify: “is this answer objectively correct?”
Three breakthrough properties:
Non-gameable rewards: Code passes tests or doesn’t. Allows much longer optimization runs.
Emergent reasoning: Models discover their own problem-solving strategies rather than imitating humans.
High capability per dollar: Compute redirected into RL runs against verifiable rewards becomes the primary capability driver.
RLVR is the shift from “reward the vibe” to “reward the verifiable truth” — and it’s what makes the agentic revolution possible.
What is The Four Stages of AI Training: From Pretraining to RLVR?
Understanding why agents work now requires understanding the four-stage training evolution that brought us here.
What is Stage 1: Pretraining (~2020)?
Raw pattern learning from massive text corpora. Broad knowledge but no instruction following. Expensive, data-hungry, produces generalist capabilities. Think: encyclopedic knowledge, no understanding of what you're asking.
What is Stage 2: Supervised Fine-Tuning (~2022)?
Learning by imitation — showing correct input-output pairs. The model learns to answer questions and follow directions. Limited to what humans can demonstrate. If the optimal reasoning trace isn't obvious, SFT can't teach it.
What is Stage 3: RLHF (~2022)?
Reinforcement Learning from Human Feedback — humans compare outputs and say "this one is better." Works for tone, safety, helpfulness. Bottlenecked by expensive preference labels. Gameable: models learn to produce outputs that look impressive rather than are correct. The reward signal is noisy because human preferences are inconsistent.
What is Stage 4: RLVR (~2025) — The Breakthrough?
Reinforcement Learning from Verifiable Rewards. Instead of "which answer looks better?", programs verify: "is this answer objectively correct?"
Gennaro is the creator of FourWeekMBA, which reached about four million business people, comprising C-level executives, investors, analysts, product managers, and aspiring digital entrepreneurs in 2022 alone | He is also Director of Sales for a high-tech scaleup in the AI Industry | In 2012, Gennaro earned an International MBA with emphasis on Corporate Finance and Business Strategy.
Scroll to Top
Discover more from FourWeekMBA
Subscribe now to keep reading and get access to the full archive.