AI benchmark · Knowledge & reasoning
Massive Multitask Language Understanding
Short name: MMLU · Introduced 2020 · Hendrycks et al.
57-subject multiple-choice exam spanning STEM, humanities, social sciences, and professional knowledge.
What it measures
Breadth of factual knowledge and basic reasoning across academic and professional domains, from elementary mathematics to US foreign policy.
Format
~15,900 four-option multiple-choice questions sourced from real exams (AP, MCAT, bar, GRE, etc.). Models are evaluated zero- and few-shot.
Scoring
Accuracy averaged across all 57 subjects. Human expert baseline ~89.8%; random guessing 25%.
Notable results
- GPT-3 (2020): 43.9% - near-random in many subjects.
- GPT-4 (2023): ~86.4% five-shot.
- Claude 3 Opus and GPT-4o exceed 88%, approaching human expert ceiling.
Strengths
- Broad coverage of academic domains.
- Reusable, standardized, and widely reported.
Limitations
- Multiple choice masks reasoning quality.
- Heavy training-set contamination concerns.
- Saturating - discriminative power at the frontier is diminishing.
Related entities
Other AI benchmarks
ARC-AGI
Visual puzzle benchmark designed to resist memorization and test fluid abstraction.
Open
GPQA
PhD-written hard-science questions designed so non-experts can't solve them even with web search.
Open
HumanEval
164 hand-written Python programming problems evaluated by unit tests.
Open
SWE-Bench
Real GitHub issues from popular Python repos - does the model produce a patch that passes the project's tests?
Open
