Intelligence Benchmark Center

How we measure intelligence - machine and human

A reference center for the evaluations that define progress in AI and the psychometric tasks that define human cognition. Each benchmark page covers what it measures, format, scoring, notable results, and limitations.

AI benchmarks

Knowledge & reasoning

MMLU

Massive Multitask Language Understanding

57-subject multiple-choice exam spanning STEM, humanities, social sciences, and professional knowledge.

Open

Fluid reasoning

ARC-AGI

Abstraction and Reasoning Corpus for AGI

Visual puzzle benchmark designed to resist memorization and test fluid abstraction.

Open

Expert reasoning

GPQA

Graduate-Level Google-Proof Q&A

PhD-written hard-science questions designed so non-experts can't solve them even with web search.

Open

Code generation

HumanEval

164 hand-written Python programming problems evaluated by unit tests.

Open

Software engineering

SWE-Bench

Real GitHub issues from popular Python repos - does the model produce a patch that passes the project's tests?

Open

General capability survey

BIG-Bench

Beyond the Imitation Game Benchmark

Collaborative 200+ task benchmark probing diverse, often unusual capabilities.

Open

Human cognition benchmarks

General intelligence (g)

Intelligence Quotient Tests (WAIS, Raven's, Stanford-Binet)

Standardized batteries estimating general cognitive ability (g) relative to a normed population.

Open

Working memory

Working Memory

Working Memory Tasks (N-Back, Digit Span, Corsi Block)

Tasks measuring the capacity to hold and manipulate information in mind over seconds.

Open

Speed of cognition

Processing Speed

Processing Speed (Symbol Search, Digit-Symbol Coding, Reaction Time)

Measures how quickly the brain executes simple cognitive operations.

Open

Cognitive control

Executive Function

Executive Function (Stroop, Trail Making, Wisconsin Card Sorting)

Tasks isolating inhibition, set-shifting, and monitoring - the brain's cognitive control suite.

Open

Attention

Attention (Continuous Performance Test, ANT, Posner Cueing)

Tasks dissociating alerting, orienting, and executive attention networks.

Open

Flexibility & creativity

Cognitive Flexibility

Cognitive Flexibility (Task Switching, WCST, Alternative Uses)

Tasks measuring the ability to shift mental sets and generate divergent solutions.

Open

AI vs human comparison frameworks

AI vs Human Reasoning Benchmarks

Modern LLMs match or exceed human-expert averages on saturated knowledge benchmarks, while still trailing humans on fluid abstraction tasks built to resist memorization.

Open framework

AI vs Human Coding Benchmarks

AI agents now solve a majority of real software-engineering tasks on SWE-Bench Verified, approaching mid-level engineer throughput on bounded issues but still trailing on novel architecture and ambiguous specifications.

Open framework