RLVR and the Verifiability Spectrum: Why Code Fell First and What Falls Next

RLVR Verifiability Spectrum - Four Tiers

To understand why agentic coding is the proving ground, and what determines the sequence of domains that follow, you have to look beneath the product layer at the training paradigm that made it all possible.

The Four-Stage Training Evolution

  • Stage 1 — Pretraining (~2020): Raw pattern learning from massive text corpora. Broad knowledge but no ability to follow instructions or reason reliably.
  • Stage 2 — Supervised Finetuning (~2022): Learning by imitation—showing the model correct input-output pairs. Limited to what humans can demonstrate.
  • Stage 3 — RLHF (~2022): Reinforcement Learning from Human Feedback for tone, safety, helpfulness. Bottlenecked by expensive, noisy human preferences.
  • Stage 4 — RLVR (~2025): Reinforcement Learning from Verifiable Rewards. Instead of “which answer looks better?”, programs and test suites verify: “is this answer objectively correct?”

Why RLVR Was the Breakthrough

Three properties set it apart:

  1. Non-gameable objective rewards. Code either passes the test suite or it doesn’t. This allows much longer optimization runs.
  2. Emergent reasoning strategies. Models spontaneously developed problem-solving approaches—breaking problems into steps, trying multiple approaches, backtracking from dead ends. The model discovers its own strategies rather than imitating human ones.
  3. High capability per dollar. Compute redirected from pretraining into longer RL runs against verifiable rewards became the primary driver of capability progress.

The Verifiability Spectrum: Why Code Fell First

The feasibility of agentic AI in any domain maps directly to where that domain sits on the verifiability spectrum:

Tier 1 — Binary Deterministic Verification (Code, Math): The verifier is a program. The answer is objectively right or wrong. This is where RLVR started and where agents are most mature.

Tier 2 — Structured Verification with Reference Answers (Medicine, Chemistry, Finance, Law): Correct answers exist but are embedded in unstructured formats. Research shows RLVR extends successfully across medicine, chemistry, psychology, economics.

Tier 3 — Partial Verification with Rubric-Based Rewards (Complex Reasoning, Strategy, Design): No single correct answer, but quality can be decomposed into verifiable sub-criteria. The RLVRR framework transforms single-point supervision into “reward chains.”

Tier 4 — Subjective Quality (Creative Work, Brand Voice, Negotiation): No objective verification possible. RLHF remains necessary and agents require the most human oversight.


This is part of a comprehensive analysis. Read the full analysis on The Business Engineer.

Scroll to Top

Discover more from FourWeekMBA

Subscribe now to keep reading and get access to the full archive.

Continue reading

FourWeekMBA