This site demonstrates one possible use of this domain. For acquisition, partnership, or investment inquiries, please use our contact link.

Benchmarks

Intelligence Benchmarks Center

Reference explainers for how we measure intelligence — in humans, in AI systems, and at the frontier of AGI evaluation.

Key takeaways

No single test measures intelligence; benchmarks always measure a slice.
Human and AI benchmarks measure different things even when they share names.
AGI evaluation is unsolved and remains the most consequential open problem in the field.

What this center covers

Human intelligence metrics, working-memory measures, the broader cognitive-testing landscape, the major AI benchmarks (MMLU, GPQA, ARC, SWE-Bench, Humanity's Last Exam), and why AGI evaluation is harder than any of them.

How to read benchmark scores

Treat any single number as a sample, not a verdict. The construct, the contamination risk, the test-set size, and the population matter as much as the headline accuracy.

Frequently asked questions

Why so many benchmarks?

Because intelligence is multi-dimensional. Each benchmark targets a slice; no benchmark covers all of them.

Are AI benchmarks comparable to IQ tests?

Only loosely. IQ tests were designed for humans and rely on assumptions about test-taker behaviour and prior exposure that don't apply to AI.

Continue exploring

Human intelligence metrics

Working memory metrics

Cognitive testing

NeuroAI Center

Architectures and theories behind the systems being benchmarked.

Brain Economy hub

Why these measurements matter for the wider Brain Economy.

Future Intelligence Timeline

How benchmark progress maps onto the 2026–2050 outlook.