Artificial General Intelligence

Scaling Laws for Neural Language Models

Kaplan et al. · 2020 · OpenAI Technical Report

Showed that LLM performance follows smooth, predictable power-law relationships with compute, data, and parameters.

Research objective

Characterize how language-model loss scales with model size, dataset size, and compute budget.

Trained Transformer language models spanning 7 orders of magnitude in size and compute, measuring cross-entropy loss on held-out data.

Loss scales as a power law in parameters, data, and compute when not bottlenecked.
Larger models are more sample-efficient than smaller ones.
Optimal allocation of compute can be predicted in advance.

Training Compute-Optimal Large Language Models

Demonstrated that for a fixed compute budget, model size and training tokens should scale roughly equally.

Read summary

Emergent Abilities of Large Language Models

Argued that certain capabilities appear abruptly above a scale threshold rather than improving smoothly.

Read summary