AI benchmark · Knowledge & reasoning

Massive Multitask Language Understanding

Short name: MMLU · Introduced 2020 · Hendrycks et al.

57-subject multiple-choice exam spanning STEM, humanities, social sciences, and professional knowledge.

What it measures

Breadth of factual knowledge and basic reasoning across academic and professional domains, from elementary mathematics to US foreign policy.

~15,900 four-option multiple-choice questions sourced from real exams (AP, MCAT, bar, GRE, etc.). Models are evaluated zero- and few-shot.

Accuracy averaged across all 57 subjects. Human expert baseline ~89.8%; random guessing 25%.

ARC-AGI

Visual puzzle benchmark designed to resist memorization and test fluid abstraction.

Open

GPQA

PhD-written hard-science questions designed so non-experts can't solve them even with web search.

Open

HumanEval

164 hand-written Python programming problems evaluated by unit tests.

Open

SWE-Bench

Real GitHub issues from popular Python repos - does the model produce a patch that passes the project's tests?

Open