AI benchmark · General capability survey
Beyond the Imitation Game Benchmark
Short name: BIG-Bench · Introduced 2022 · Srivastava et al. (450+ contributors)
Collaborative 200+ task benchmark probing diverse, often unusual capabilities.
What it measures
Capability breadth across linguistics, math, common-sense reasoning, social bias, mistranslations, theory of mind, and many novel tasks.
Format
204 tasks contributed by researchers worldwide. BIG-Bench Hard (BBH) distills the 23 tasks where models lagged human raters most.
Scoring
Task-specific metrics; aggregate normalized score versus human raters.
Notable results
- Used to document emergent abilities (Wei et al., 2022).
- BBH became the standard reasoning subset; chain-of-thought prompting unlocked large jumps.
Strengths
- Unusually diverse and creative.
- Open contribution model surfaces blind spots.
Limitations
- Task quality varies widely.
- Aggregation hides task-level signal.
Related entities
Other AI benchmarks
MMLU
57-subject multiple-choice exam spanning STEM, humanities, social sciences, and professional knowledge.
Open
ARC-AGI
Visual puzzle benchmark designed to resist memorization and test fluid abstraction.
Open
GPQA
PhD-written hard-science questions designed so non-experts can't solve them even with web search.
Open
HumanEval
164 hand-written Python programming problems evaluated by unit tests.
Open
