AI benchmark · General capability survey

Beyond the Imitation Game Benchmark

Short name: BIG-Bench · Introduced 2022 · Srivastava et al. (450+ contributors)

Collaborative 200+ task benchmark probing diverse, often unusual capabilities.

What it measures

Capability breadth across linguistics, math, common-sense reasoning, social bias, mistranslations, theory of mind, and many novel tasks.

204 tasks contributed by researchers worldwide. BIG-Bench Hard (BBH) distills the 23 tasks where models lagged human raters most.

Task-specific metrics; aggregate normalized score versus human raters.

Used to document emergent abilities (Wei et al., 2022).
BBH became the standard reasoning subset; chain-of-thought prompting unlocked large jumps.

MMLU

57-subject multiple-choice exam spanning STEM, humanities, social sciences, and professional knowledge.

Open

ARC-AGI

Visual puzzle benchmark designed to resist memorization and test fluid abstraction.

Open

GPQA

PhD-written hard-science questions designed so non-experts can't solve them even with web search.

Open

HumanEval

164 hand-written Python programming problems evaluated by unit tests.

Open