Scaling

Scaling Laws and Compute

Capabilities of modern AI improve predictably with model size, dataset size, and training compute - a finding with deep implications for research, economics, and policy.

10 min read Updated April 9, 2026

By Dr. Ira S. Pastor· Editor-in-ChiefReviewed by BrainMatter Science Review Board

Key facts

Kaplan et al. (2020) established neural scaling laws across 7+ orders of magnitude.
Chinchilla (2022) revised optimal model/data scaling to ~20 tokens per parameter.
Frontier training compute has grown ~4–5x per year since 2010.
Frontier 2025 training runs are estimated at ~10^26 FLOPs.
Inference-time scaling emerged as a second productive axis in 2024.
High-quality public text data is projected to be exhausted between 2026 and 2032.

Kaplan and Chinchilla Laws

Kaplan et al. (OpenAI, 2020) showed that test loss decreases as a smooth power law in compute, parameters, and data - across more than seven orders of magnitude.

Chinchilla (Hoffmann et al., DeepMind, 2022) refined this: for a fixed compute budget, optimal performance requires scaling model size and training tokens together, roughly 20 tokens per parameter. Most pre-Chinchilla models were significantly under-trained.

The Compute Trajectory

Compute used for frontier training has grown roughly 4–5x per year since 2010 - far faster than Moore's Law (~1.4x per year). Epoch AI tracks this trend across hundreds of notable training runs.

Frontier 2025 training runs are estimated at 10^26 FLOPs, costing hundreds of millions of dollars. Single-cluster scale has surpassed 100,000 H100-equivalent GPUs (xAI Colossus, OpenAI/Microsoft Stargate plans).

Inference-Time Scaling

Since 2024, a second scaling axis has emerged: spending more compute at inference time produces better answers on hard problems. OpenAI's o1/o3 and DeepSeek-R1 demonstrate that reasoning-trained models scale predictably with thinking tokens.

This shifts the economic frontier: inference compute now rivals training compute in importance, and per-query cost can vary by orders of magnitude depending on reasoning depth.

Limits to Scaling

Data exhaustion: high-quality public text is finite; Villalobos et al. (Epoch AI, 2024) project depletion of high-quality language data by 2026–2032. Synthetic data and multimodal sources extend the runway.

Energy and chips: a single 100K-GPU cluster draws ~150 MW. US grid expansion and TSMC advanced-node capacity are now binding constraints. Capital availability - frontier labs are raising tens of billions annually - completes the limiting trio.

Frequently asked

Will scaling alone produce AGI?

Contested. Some researchers project continued capability gains will produce AGI by scaling alone; others believe new architectural and learning-algorithm ideas will be required. Both camps include serious researchers.

How much compute does GPT-4 use?

Public estimates suggest ~2x10^25 FLOPs of training compute, equivalent to tens of thousands of A100/H100 GPUs over months and a total cost in the high tens of millions of dollars.

What is inference-time scaling?

Using more compute per query - by sampling many candidates, running explicit chain-of-thought, or tree search - to improve answer quality on hard problems. Now standard in reasoning models.

Sources & further reading

Foundations

Machine Learning: The Foundations

Neural Networks

Deep Learning: Hierarchical Representation from Raw Data

Architecture

The Transformer Architecture

LLMs

Large Language Models: How They Work and Where They Fail

Cross-Modal

Multimodal AI: Text, Vision, Audio, Video, and Action

Learning from Reward

Reinforcement Learning: From AlphaGo to RLHF

Back to Artificial Intelligence hub