
Intelligence Benchmarks Center
Reference explainers for how we measure intelligence — in humans, in AI systems, and at the frontier of AGI evaluation.
Key takeaways
- No single test measures intelligence; benchmarks always measure a slice.
- Human and AI benchmarks measure different things even when they share names.
- AGI evaluation is unsolved and remains the most consequential open problem in the field.
What this center covers
Human intelligence metrics, working-memory measures, the broader cognitive-testing landscape, the major AI benchmarks (MMLU, GPQA, ARC, SWE-Bench, Humanity's Last Exam), and why AGI evaluation is harder than any of them.
How to read benchmark scores
Treat any single number as a sample, not a verdict. The construct, the contamination risk, the test-set size, and the population matter as much as the headline accuracy.
Frequently asked questions
Why so many benchmarks?
+
Because intelligence is multi-dimensional. Each benchmark targets a slice; no benchmark covers all of them.
Are AI benchmarks comparable to IQ tests?
+
Only loosely. IQ tests were designed for humans and rely on assumptions about test-taker behaviour and prior exposure that don't apply to AI.
Continue exploring
Human intelligence metrics
Working memory metrics
Cognitive testing
NeuroAI Center
Architectures and theories behind the systems being benchmarked.
Brain Economy hub
Why these measurements matter for the wider Brain Economy.
Future Intelligence Timeline
How benchmark progress maps onto the 2026–2050 outlook.
