In 2020, OpenAI published the original scaling laws alongside GPT-3 and established the first quantitative framework for AI capability growth. The thesis was straightforward: performance scales as a power law with model size.
The Numbers
GPT-3 used 175 billion parameters trained on 300 billion tokens — a ratio of roughly 1.7 tokens per parameter. The assumption: model size mattered more than data volume.
The industry responded accordingly. The race was to build the biggest model possible within a given compute budget.
What It Got Right
The Kaplan paper proved that capability scales predictably with compute. GPT-3’s few-shot abilities genuinely surprised researchers and launched the generative AI wave. Scaling wasn’t a guess anymore — it was an infrastructure blueprint.
The Critical Blind Spot
The scaling law dramatically undervalued data relative to parameters. Models were large but undertrained. It would take DeepMind’s Chinchilla paper two years later to reveal just how wrong the allocation was — GPT-3 should have been trained on 11x more data or been 20x smaller.
The Kaplan era wasn’t just a research finding — it was the largest capital deployment in computing history. Every GPU cluster built, every training run funded, followed this blueprint.









