Artificial General Intelligence
Training Compute-Optimal Large Language Models
Hoffmann et al. · 2022 · DeepMind
Demonstrated that for a fixed compute budget, model size and training tokens should scale roughly equally.
Research objective
Determine the optimal trade-off between model size and training data given a fixed compute budget.
Methodology
Trained over 400 language models ranging from 70M to 16B parameters on 5B–500B tokens, fitting scaling functions to predict optimal allocation.
Key findings
- Most existing LLMs (including GPT-3 and Gopher) were significantly undertrained.
- A 70B-parameter model (Chinchilla) trained on 1.4T tokens outperformed much larger models.
- Optimal scaling: parameters and tokens should grow at similar rates.
Strengths
- Revised industry intuition about scaling.
- Reduced inference cost by favoring smaller, better-trained models.
Limitations
- Focused on perplexity; downstream capabilities scale differently.
- Data quality and curation effects were not fully isolated.
Practical implications
- Modern frontier models train on trillions of tokens following Chinchilla-style allocation.
- Shifted strategic focus from parameter count to data scale and quality.
Related entities
Related research
Scaling Laws for Neural Language Models
Showed that LLM performance follows smooth, predictable power-law relationships with compute, data, and parameters.
Read summary
Emergent Abilities of Large Language Models
Argued that certain capabilities appear abruptly above a scale threshold rather than improving smoothly.
Read summary
