Comparison framework

AI vs Human Reasoning Benchmarks

Modern LLMs match or exceed human-expert averages on saturated knowledge benchmarks, while still trailing humans on fluid abstraction tasks built to resist memorization.

Side-by-side

Dimension	AI benchmark	Human benchmark	Insight
Broad knowledge	MMLU	Crystallized intelligence (Gc)	Frontier LLMs near-saturate MMLU (~88%), comparable to broadly educated adults on Gc subtests - but only via training-set exposure rather than lived experience.
Expert reasoning	GPQA Diamond	Domain-expert testing	o-series models now exceed the ~65% domain-expert baseline; specialization, not generality, is the bottleneck.
Fluid abstraction	ARC-AGI	Raven's Progressive Matrices	Humans average ~80% on ARC-AGI; baseline LLMs <30%. High-compute o3 closes most of the gap at enormous inference cost.
Working capacity	Long-context retrieval (needle-in-haystack)	Digit span / N-back	AI 'working memory' (the context window) now exceeds human limits by orders of magnitude, but is shallower - humans manipulate, models mostly retrieve.

Important caveats

Benchmark contamination distorts AI scores; human baselines do not have this risk.
Performance ≠ understanding. Matching a score does not imply matching the underlying cognitive process.
Human ceiling effects on saturated benchmarks make percentile comparisons misleading.

Referenced benchmarks

AI · mmlu AI · gpqa AI · arc-agi Human · iq Human · working-memory

Other comparison frameworks

AI vs Human Coding Benchmarks

AI agents now solve a majority of real software-engineering tasks on SWE-Bench Verified, approaching mid-level engineer throughput on bounded issues but still trailing on novel architecture and ambiguous specifications.

Open