
Scaling Laws and Compute
Capabilities of modern AI improve predictably with model size, dataset size, and training compute — a finding with deep implications for research, economics, and policy.
Key facts
- Kaplan et al. (2020) established neural scaling laws across 7+ orders of magnitude.
- Chinchilla (2022) revised optimal model/data scaling to ~20 tokens per parameter.
- Frontier training compute has grown ~4–5x per year since 2010.
- Frontier 2025 training runs are estimated at ~10^26 FLOPs.
- Inference-time scaling emerged as a second productive axis in 2024.
- High-quality public text data is projected to be exhausted between 2026 and 2032.
Kaplan and Chinchilla Laws
Kaplan et al. (OpenAI, 2020) showed that test loss decreases as a smooth power law in compute, parameters, and data — across more than seven orders of magnitude.
Chinchilla (Hoffmann et al., DeepMind, 2022) refined this: for a fixed compute budget, optimal performance requires scaling model size and training tokens together, roughly 20 tokens per parameter. Most pre-Chinchilla models were significantly under-trained.
The Compute Trajectory
Compute used for frontier training has grown roughly 4–5x per year since 2010 — far faster than Moore's Law (~1.4x per year). Epoch AI tracks this trend across hundreds of notable training runs.
Frontier 2025 training runs are estimated at 10^26 FLOPs, costing hundreds of millions of dollars. Single-cluster scale has surpassed 100,000 H100-equivalent GPUs (xAI Colossus, OpenAI/Microsoft Stargate plans).
Inference-Time Scaling
Since 2024, a second scaling axis has emerged: spending more compute at inference time produces better answers on hard problems. OpenAI's o1/o3 and DeepSeek-R1 demonstrate that reasoning-trained models scale predictably with thinking tokens.
This shifts the economic frontier: inference compute now rivals training compute in importance, and per-query cost can vary by orders of magnitude depending on reasoning depth.
Limits to Scaling
Data exhaustion: high-quality public text is finite; Villalobos et al. (Epoch AI, 2024) project depletion of high-quality language data by 2026–2032. Synthetic data and multimodal sources extend the runway.
Energy and chips: a single 100K-GPU cluster draws ~150 MW. US grid expansion and TSMC advanced-node capacity are now binding constraints. Capital availability — frontier labs are raising tens of billions annually — completes the limiting trio.
Frequently asked
Will scaling alone produce AGI?
+
Contested. Some researchers project continued capability gains will produce AGI by scaling alone; others believe new architectural and learning-algorithm ideas will be required. Both camps include serious researchers.
How much compute does GPT-4 use?
+
Public estimates suggest ~2x10^25 FLOPs of training compute, equivalent to tens of thousands of A100/H100 GPUs over months and a total cost in the high tens of millions of dollars.
What is inference-time scaling?
+
Using more compute per query — by sampling many candidates, running explicit chain-of-thought, or tree search — to improve answer quality on hard problems. Now standard in reasoning models.
Sources & further reading
Continue in this series
Foundations
Machine Learning: The Foundations
Neural Networks
Deep Learning: Hierarchical Representation from Raw Data
Architecture
The Transformer Architecture
LLMs
Large Language Models: How They Work and Where They Fail
Cross-Modal
Multimodal AI: Text, Vision, Audio, Video, and Action
Learning from Reward
Reinforcement Learning: From AlphaGo to RLHF
