The Four Stages of AI Training: From Pretraining to RLVR

Understanding why agents work now requires understanding the four-stage training evolution that brought us here.

Table of Contents

Stage 1: Pretraining (~2020)

Raw pattern learning from massive text corpora. Broad knowledge but no instruction following. Expensive, data-hungry, produces generalist capabilities. Think: encyclopedic knowledge, no understanding of what you’re asking.

Stage 2: Supervised Fine-Tuning (~2022)

Learning by imitation — showing correct input-output pairs. The model learns to answer questions and follow directions. Limited to what humans can demonstrate. If the optimal reasoning trace isn’t obvious, SFT can’t teach it.

Stage 3: RLHF (~2022)

Reinforcement Learning from Human Feedback — humans compare outputs and say “this one is better.” Works for tone, safety, helpfulness. Bottlenecked by expensive preference labels. Gameable: models learn to produce outputs that look impressive rather than are correct. The reward signal is noisy because human preferences are inconsistent.

Stage 4: RLVR (~2025) — The Breakthrough

Reinforcement Learning from Verifiable Rewards. Instead of “which answer looks better?”, programs verify: “is this answer objectively correct?”

Three breakthrough properties:

Non-gameable rewards: Code passes tests or doesn’t. Allows much longer optimization runs.
Emergent reasoning: Models discover their own problem-solving strategies rather than imitating humans.
High capability per dollar: Compute redirected into RL runs against verifiable rewards becomes the primary capability driver.

RLVR is the shift from “reward the vibe” to “reward the verifiable truth” — and it’s what makes the agentic revolution possible.

This is part of a comprehensive analysis. Read the full analysis on The Business Engineer.

The Four Stages of AI Training: From Pretraining to RLVR

Stage 1: Pretraining (~2020)

Stage 2: Supervised Fine-Tuning (~2022)

Stage 3: RLHF (~2022)

Stage 4: RLVR (~2025) — The Breakthrough

Related

More Resources

About The Author

Gennaro Cuofano

Stage 1: Pretraining (~2020)

Stage 2: Supervised Fine-Tuning (~2022)

Stage 3: RLHF (~2022)

Stage 4: RLVR (~2025) — The Breakthrough

Related

More Resources

About The Author

Gennaro Cuofano

Discover more from FourWeekMBA