While pretraining scaling consumed the headlines, a quieter revolution was happening in the stages that came after. The production LLM stack stabilized into three layers: pretraining, supervised finetuning (SFT), and RLHF.
The Transformation
SFT turned a raw text predictor into something that could follow instructions. RLHF turned an instruction-follower into something that felt helpful, harmless, and honest. Together, they were the recipe that made ChatGPT possible.
The Escalating Economics
Post-training costs escalated rapidly. Llama 2’s post-training cost $10–20 million. Llama 3.1’s exceeded $50 million — despite using similar volumes of preference data. The cost increase came from more complex processes requiring specialized teams of ~200 people.
The Structural Ceiling
These stages had fundamental limitations:
- SFT ceiling: The model can never exceed what its human demonstrators showed it
- RLHF ceiling: Models learn to produce outputs that look correct rather than outputs that are correct
- The reward signal is noisy (humans disagree), expensive (every label needs a paid annotator), and subjective
These weren’t fixable problems. They were structural constraints of the paradigm — and they set up the need for Phase 4 and 5.









