This site demonstrates one possible use of this domain. For acquisition, partnership, or investment inquiries, please use our contact link. (brainmatter.com)

AI benchmark · Knowledge & reasoning

Massive Multitask Language Understanding

Short name: MMLU · Introduced 2020 · Hendrycks et al.

57-subject multiple-choice exam spanning STEM, humanities, social sciences, and professional knowledge.

What it measures

Breadth of factual knowledge and basic reasoning across academic and professional domains, from elementary mathematics to US foreign policy.

Format

~15,900 four-option multiple-choice questions sourced from real exams (AP, MCAT, bar, GRE, etc.). Models are evaluated zero- and few-shot.

Scoring

Accuracy averaged across all 57 subjects. Human expert baseline ~89.8%; random guessing 25%.

Notable results

  • GPT-3 (2020): 43.9% - near-random in many subjects.
  • GPT-4 (2023): ~86.4% five-shot.
  • Claude 3 Opus and GPT-4o exceed 88%, approaching human expert ceiling.

Strengths

  • Broad coverage of academic domains.
  • Reusable, standardized, and widely reported.

Limitations

  • Multiple choice masks reasoning quality.
  • Heavy training-set contamination concerns.
  • Saturating - discriminative power at the frontier is diminishing.

Related entities

Other AI benchmarks