Artificial General Intelligence
Scaling Laws for Neural Language Models
Kaplan et al. · 2020 · OpenAI Technical Report
Showed that LLM performance follows smooth, predictable power-law relationships with compute, data, and parameters.
Research objective
Characterize how language-model loss scales with model size, dataset size, and compute budget.
Methodology
Trained Transformer language models spanning 7 orders of magnitude in size and compute, measuring cross-entropy loss on held-out data.
Key findings
- Loss scales as a power law in parameters, data, and compute when not bottlenecked.
- Larger models are more sample-efficient than smaller ones.
- Optimal allocation of compute can be predicted in advance.
Strengths
- Empirical, reproducible, and actionable for capacity planning.
- Catalyzed the strategic decision to invest in larger and larger models.
Limitations
- Later refined by Chinchilla (2022), which showed Kaplan undertrained on data.
- Power laws describe trends, not capability emergence.
Practical implications
- Motivated the era of frontier model scaling.
- Set the framework that AGI labs use to forecast capabilities.
Related entities
Related research
Training Compute-Optimal Large Language Models
Demonstrated that for a fixed compute budget, model size and training tokens should scale roughly equally.
Read summary
Emergent Abilities of Large Language Models
Argued that certain capabilities appear abruptly above a scale threshold rather than improving smoothly.
Read summary
