
AI
AI benchmarks: MMLU, GPQA, ARC, SWE-Bench, HLE
The most-cited 2026 AI benchmarks and what each actually measures.
Key takeaways
- MMLU tests broad multi-task knowledge; frontier models now saturate it.
- GPQA targets graduate-level science questions resistant to web search.
- ARC-AGI 2 tests abstract reasoning on novel grid puzzles.
- SWE-Bench measures real software-engineering capability on GitHub issues.
- Humanity's Last Exam aggregates very hard expert questions across disciplines.
Comparison at a glance
MMLU: breadth, near-saturation. GPQA: depth in science. ARC: novelty and abstraction. SWE-Bench: real-world software work. Humanity's Last Exam: expert-level multi-domain reasoning. Different shapes of difficulty, different signal.
Contamination and saturation
Older benchmarks (MMLU, HumanEval) increasingly suffer from training-data contamination. Newer benchmarks (GPQA Diamond, ARC-AGI 2, HLE) are designed to resist it.
Frequently asked questions
Why do scores vary by 10+ points across providers?
+
Prompt phrasing, chain-of-thought, scaffolding, and evaluation harness all matter. Apples-to-apples comparisons require the same harness.
Are benchmark scores predictive of real-world use?
+
Partially. They correlate, but workflow design and reliability often matter more than raw benchmark numbers.
