Comparison framework
AI vs Human Reasoning Benchmarks
Modern LLMs match or exceed human-expert averages on saturated knowledge benchmarks, while still trailing humans on fluid abstraction tasks built to resist memorization.
Side-by-side
| Dimension | AI benchmark | Human benchmark | Insight |
|---|---|---|---|
| Broad knowledge | MMLU | Crystallized intelligence (Gc) | Frontier LLMs near-saturate MMLU (~88%), comparable to broadly educated adults on Gc subtests - but only via training-set exposure rather than lived experience. |
| Expert reasoning | GPQA Diamond | Domain-expert testing | o-series models now exceed the ~65% domain-expert baseline; specialization, not generality, is the bottleneck. |
| Fluid abstraction | ARC-AGI | Raven's Progressive Matrices | Humans average ~80% on ARC-AGI; baseline LLMs <30%. High-compute o3 closes most of the gap at enormous inference cost. |
| Working capacity | Long-context retrieval (needle-in-haystack) | Digit span / N-back | AI 'working memory' (the context window) now exceeds human limits by orders of magnitude, but is shallower - humans manipulate, models mostly retrieve. |
Important caveats
- Benchmark contamination distorts AI scores; human baselines do not have this risk.
- Performance ≠ understanding. Matching a score does not imply matching the underlying cognitive process.
- Human ceiling effects on saturated benchmarks make percentile comparisons misleading.
