DeepMind’s Chinchilla paper was a watershed moment. By training models across a much wider range of sizes and data volumes, it demonstrated that Kaplan’s scaling laws had a systematic bias.
The Corrected Finding
For compute-optimal training, model size and training data should scale equally. The Chinchilla ratio of approximately 20 tokens per parameter became the new orthodoxy.
The practical implication was stark — the entire previous generation of models (GPT-3, Gopher, PaLM) were severely undertrained.
Meta’s Response: Deliberate Overtraining
Chinchilla created its own trap — following the law strictly led to models too large for efficient inference. Meta’s response with Llama was to deliberately overtrain:
- Llama 1: 142 tokens per parameter
- Llama 2: 284 tokens per parameter
- Llama 3 (8B): 1,875 tokens per parameter
The loss kept decreasing beyond Chinchilla-optimal ratios, and the resulting models were small enough to deploy economically.
The Small-Model Revolution
Chinchilla enabled what Kaplan never imagined: compact, efficient models that could run anywhere. Smaller models trained on more data could match or exceed the performance of giants at a fraction of the inference cost.
Data matters as much as parameters. And sometimes, more.









