Artificial General Intelligence

Training Compute-Optimal Large Language Models

Hoffmann et al. · 2022 · DeepMind

Demonstrated that for a fixed compute budget, model size and training tokens should scale roughly equally.

Research objective

Determine the optimal trade-off between model size and training data given a fixed compute budget.

Trained over 400 language models ranging from 70M to 16B parameters on 5B–500B tokens, fitting scaling functions to predict optimal allocation.

Most existing LLMs (including GPT-3 and Gopher) were significantly undertrained.
A 70B-parameter model (Chinchilla) trained on 1.4T tokens outperformed much larger models.
Optimal scaling: parameters and tokens should grow at similar rates.

Modern frontier models train on trillions of tokens following Chinchilla-style allocation.
Shifted strategic focus from parameter count to data scale and quality.

Scaling Laws for Neural Language Models

Showed that LLM performance follows smooth, predictable power-law relationships with compute, data, and parameters.

Read summary

Emergent Abilities of Large Language Models

Argued that certain capabilities appear abruptly above a scale threshold rather than improving smoothly.

Read summary