This site demonstrates one possible use of this domain. For acquisition, partnership, or investment inquiries, please use our contact link.

AI benchmarks: MMLU, GPQA, ARC, SWE-Bench, HLE

The most-cited 2026 AI benchmarks and what each actually measures.

Key takeaways

MMLU tests broad multi-task knowledge; frontier models now saturate it.
GPQA targets graduate-level science questions resistant to web search.
ARC-AGI 2 tests abstract reasoning on novel grid puzzles.
SWE-Bench measures real software-engineering capability on GitHub issues.
Humanity's Last Exam aggregates very hard expert questions across disciplines.

Comparison at a glance

MMLU: breadth, near-saturation. GPQA: depth in science. ARC: novelty and abstraction. SWE-Bench: real-world software work. Humanity's Last Exam: expert-level multi-domain reasoning. Different shapes of difficulty, different signal.

Contamination and saturation

Older benchmarks (MMLU, HumanEval) increasingly suffer from training-data contamination. Newer benchmarks (GPQA Diamond, ARC-AGI 2, HLE) are designed to resist it.

Frequently asked questions

Why do scores vary by 10+ points across providers?

Prompt phrasing, chain-of-thought, scaffolding, and evaluation harness all matter. Apples-to-apples comparisons require the same harness.

Are benchmark scores predictive of real-world use?

Partially. They correlate, but workflow design and reliability often matter more than raw benchmark numbers.

Sources & further reading

Continue exploring

Benchmarks center

Human intelligence metrics

Working memory metrics

Cognitive testing