Intelligence Benchmark Center
How we measure intelligence - machine and human
A reference center for the evaluations that define progress in AI and the psychometric tasks that define human cognition. Each benchmark page covers what it measures, format, scoring, notable results, and limitations.
AI benchmarks
Knowledge & reasoning
MMLU
Massive Multitask Language Understanding
57-subject multiple-choice exam spanning STEM, humanities, social sciences, and professional knowledge.
Open
Fluid reasoning
ARC-AGI
Abstraction and Reasoning Corpus for AGI
Visual puzzle benchmark designed to resist memorization and test fluid abstraction.
Open
Expert reasoning
GPQA
Graduate-Level Google-Proof Q&A
PhD-written hard-science questions designed so non-experts can't solve them even with web search.
Open
Code generation
HumanEval
HumanEval
164 hand-written Python programming problems evaluated by unit tests.
Open
Software engineering
SWE-Bench
SWE-Bench
Real GitHub issues from popular Python repos - does the model produce a patch that passes the project's tests?
Open
General capability survey
BIG-Bench
Beyond the Imitation Game Benchmark
Collaborative 200+ task benchmark probing diverse, often unusual capabilities.
Open
Human cognition benchmarks
General intelligence (g)
IQ
Intelligence Quotient Tests (WAIS, Raven's, Stanford-Binet)
Standardized batteries estimating general cognitive ability (g) relative to a normed population.
Open
Working memory
Working Memory
Working Memory Tasks (N-Back, Digit Span, Corsi Block)
Tasks measuring the capacity to hold and manipulate information in mind over seconds.
Open
Speed of cognition
Processing Speed
Processing Speed (Symbol Search, Digit-Symbol Coding, Reaction Time)
Measures how quickly the brain executes simple cognitive operations.
Open
Cognitive control
Executive Function
Executive Function (Stroop, Trail Making, Wisconsin Card Sorting)
Tasks isolating inhibition, set-shifting, and monitoring - the brain's cognitive control suite.
Open
Attention
Attention
Attention (Continuous Performance Test, ANT, Posner Cueing)
Tasks dissociating alerting, orienting, and executive attention networks.
Open
Flexibility & creativity
Cognitive Flexibility
Cognitive Flexibility (Task Switching, WCST, Alternative Uses)
Tasks measuring the ability to shift mental sets and generate divergent solutions.
Open
AI vs human comparison frameworks
AI vs Human Reasoning Benchmarks
Modern LLMs match or exceed human-expert averages on saturated knowledge benchmarks, while still trailing humans on fluid abstraction tasks built to resist memorization.
Open framework
AI vs Human Coding Benchmarks
AI agents now solve a majority of real software-engineering tasks on SWE-Bench Verified, approaching mid-level engineer throughput on bounded issues but still trailing on novel architecture and ambiguous specifications.
Open framework
