AI benchmark · Expert reasoning

Graduate-Level Google-Proof Q&A

Short name: GPQA · Introduced 2023 · Rein et al.

PhD-written hard-science questions designed so non-experts can't solve them even with web search.

What it measures

Deep domain reasoning in biology, chemistry, and physics at the graduate level - minimizing the lookup shortcut.

Format

448 multiple-choice questions written by domain PhDs, validated by other PhDs, and confirmed unsolvable by skilled non-experts with unrestricted web access.

Scoring

Accuracy on the 'Diamond' subset (198 highest-quality items). Domain-expert baseline ~65%; skilled non-experts ~34%.

Notable results

GPT-4 (2023): ~39%.
Claude 3.5 Sonnet (2024): ~59.4%.
o1 and o3 series: >75%, surpassing human PhD average.

Strengths

Hard to game with retrieval.
Replaces saturated knowledge benchmarks at the frontier.

Limitations

Still multiple choice.
Small total item count limits statistical power.

Visit official source

Related entities

Atlas · machine-intelligence Atlas · agi

Other AI benchmarks

MMLU

57-subject multiple-choice exam spanning STEM, humanities, social sciences, and professional knowledge.

Open

ARC-AGI

Visual puzzle benchmark designed to resist memorization and test fluid abstraction.

Open

HumanEval

164 hand-written Python programming problems evaluated by unit tests.

Open

SWE-Bench

Real GitHub issues from popular Python repos - does the model produce a patch that passes the project's tests?

Open