This site demonstrates one possible use of this domain. For acquisition, partnership, or investment inquiries, please use our contact link. (brainmatter.com)

Artificial General Intelligence

Training Compute-Optimal Large Language Models

Hoffmann et al. · 2022 · DeepMind

Demonstrated that for a fixed compute budget, model size and training tokens should scale roughly equally.

Research objective

Determine the optimal trade-off between model size and training data given a fixed compute budget.

Methodology

Trained over 400 language models ranging from 70M to 16B parameters on 5B–500B tokens, fitting scaling functions to predict optimal allocation.

Key findings

  • Most existing LLMs (including GPT-3 and Gopher) were significantly undertrained.
  • A 70B-parameter model (Chinchilla) trained on 1.4T tokens outperformed much larger models.
  • Optimal scaling: parameters and tokens should grow at similar rates.

Strengths

  • Revised industry intuition about scaling.
  • Reduced inference cost by favoring smaller, better-trained models.

Limitations

  • Focused on perplexity; downstream capabilities scale differently.
  • Data quality and curation effects were not fully isolated.

Practical implications

  • Modern frontier models train on trillions of tokens following Chinchilla-style allocation.
  • Shifted strategic focus from parameter count to data scale and quality.

Related entities

Related research