From SFT to RLHF: The Thin Layers That Made ChatGPT Possible
While pretraining scaling — as explored in the emerging fifth paradigm of scaling — consumed the headlines, a quieter revolution was happening in the stages that came after. The production LLM stack stabilized into three layers: pretraining , supervised finetuning (SFT) , and RLHF .
Key Components
The Transformation
SFT turned a raw text predictor into something that could follow instructions. RLHF turned an instruction-follower into something that felt helpful, harmless, and honest.
The Escalating Economics
Post-training costs escalated rapidly. Llama 2's post-training cost $10–20 million . Llama 3.1's exceeded $50 million — despite using similar volumes of preference data.
The Structural Ceiling
These stages had fundamental limitations:
Key Insight
Post-training costs escalated rapidly. Llama 2's post-training cost $10–20 million . Llama 3.1's exceeded $50 million — despite using similar volumes of preference data. The cost increase came from more complex processes requiring specialized teams of ~200 people .
Exec Package + Claude OS Master Skill | Business Engineer Founding Plan
FourWeekMBA x Business Engineer | Updated 2026
While pretraining scaling consumed the headlines, a quieter revolution was happening in the stages that came after. The production LLM stack stabilized into three layers: pretraining, supervised finetuning (SFT), and RLHF.
The Five Scaling Phases of AI — Animated Explainer
The Transformation
SFT turned a raw text predictor into something that could follow instructions. RLHF turned an instruction-follower into something that felt helpful, harmless, and honest. Together, they were the recipe that made ChatGPT — as explored in the intelligence factory race between AI labs — possible.
The Escalating Economics
Post-training costs escalated rapidly. Llama 2’s post-training cost$10–20 million. Llama 3.1’s exceeded $50 million — despite using similar volumes of preference data. The cost increase came from more complex processes requiring specialized teams of ~200 people.
The Structural Ceiling
These stages had fundamental limitations:
SFT ceiling: The model can never exceed what its human demonstrators showed it
RLHF ceiling: Models learn to produce outputs that look correct rather than outputs that are correct
The reward signal is noisy (humans disagree), expensive (every label needs a paid annotator), and subjective
These weren’t fixable problems. They were structural constraints of the paradigm — and they set up the need for Phase 4 and 5.
What is From SFT to RLHF: The Thin Layers That Made ChatGPT Possible?
While pretraining scaling consumed the headlines, a quieter revolution was happening in the stages that came after. The production LLM stack stabilized into three layers: pretraining , supervised finetuning (SFT) , and RLHF .
What is the escalating economics?
Post-training costs escalated rapidly. Llama 2's post-training cost $10–20 million . Llama 3.1's exceeded $50 million — despite using similar volumes of preference data. The cost increase came from more complex processes requiring specialized teams of ~200 people .
What is the structural ceiling?
These weren't fixable problems. They were structural constraints of the paradigm — and they set up the need for Phase 4 and 5.
Gennaro is the creator of FourWeekMBA, which reached about four million business people, comprising C-level executives, investors, analysts, product managers, and aspiring digital entrepreneurs in 2022 alone | He is also Director of Sales for a high-tech scaleup in the AI Industry | In 2012, Gennaro earned an International MBA with emphasis on Corporate Finance and Business Strategy.
Scroll to Top
Discover more from FourWeekMBA
Subscribe now to keep reading and get access to the full archive.