This site demonstrates one possible use of this domain. For acquisition, partnership, or investment inquiries, please use our contact link. (brainmatter.com)

AI benchmark · General capability survey

Beyond the Imitation Game Benchmark

Short name: BIG-Bench · Introduced 2022 · Srivastava et al. (450+ contributors)

Collaborative 200+ task benchmark probing diverse, often unusual capabilities.

What it measures

Capability breadth across linguistics, math, common-sense reasoning, social bias, mistranslations, theory of mind, and many novel tasks.

Format

204 tasks contributed by researchers worldwide. BIG-Bench Hard (BBH) distills the 23 tasks where models lagged human raters most.

Scoring

Task-specific metrics; aggregate normalized score versus human raters.

Notable results

  • Used to document emergent abilities (Wei et al., 2022).
  • BBH became the standard reasoning subset; chain-of-thought prompting unlocked large jumps.

Strengths

  • Unusually diverse and creative.
  • Open contribution model surfaces blind spots.

Limitations

  • Task quality varies widely.
  • Aggregation hides task-level signal.

Related entities

Other AI benchmarks