
Understanding why agents work now requires understanding the four-stage training evolution that brought us here.
Stage 1: Pretraining (~2020)
Raw pattern learning from massive text corpora. Broad knowledge but no instruction following. Expensive, data-hungry, produces generalist capabilities. Think: encyclopedic knowledge, no understanding of what you’re asking.
Stage 2: Supervised Fine-Tuning (~2022)
Learning by imitation — showing correct input-output pairs. The model learns to answer questions and follow directions. Limited to what humans can demonstrate. If the optimal reasoning trace isn’t obvious, SFT can’t teach it.
Stage 3: RLHF (~2022)
Reinforcement Learning from Human Feedback — humans compare outputs and say “this one is better.” Works for tone, safety, helpfulness. Bottlenecked by expensive preference labels. Gameable: models learn to produce outputs that look impressive rather than are correct. The reward signal is noisy because human preferences are inconsistent.
Stage 4: RLVR (~2025) — The Breakthrough
Reinforcement Learning from Verifiable Rewards. Instead of “which answer looks better?”, programs verify: “is this answer objectively correct?”
Three breakthrough properties:
- Non-gameable rewards: Code passes tests or doesn’t. Allows much longer optimization runs.
- Emergent reasoning: Models discover their own problem-solving strategies rather than imitating humans.
- High capability per dollar: Compute redirected into RL runs against verifiable rewards becomes the primary capability driver.
RLVR is the shift from “reward the vibe” to “reward the verifiable truth” — and it’s what makes the agentic revolution possible.
This is part of a comprehensive analysis. Read the full analysis on The Business Engineer.








