AI benchmark · Expert reasoning
Graduate-Level Google-Proof Q&A
Short name: GPQA · Introduced 2023 · Rein et al.
PhD-written hard-science questions designed so non-experts can't solve them even with web search.
What it measures
Deep domain reasoning in biology, chemistry, and physics at the graduate level - minimizing the lookup shortcut.
Format
448 multiple-choice questions written by domain PhDs, validated by other PhDs, and confirmed unsolvable by skilled non-experts with unrestricted web access.
Scoring
Accuracy on the 'Diamond' subset (198 highest-quality items). Domain-expert baseline ~65%; skilled non-experts ~34%.
Notable results
- GPT-4 (2023): ~39%.
- Claude 3.5 Sonnet (2024): ~59.4%.
- o1 and o3 series: >75%, surpassing human PhD average.
Strengths
- Hard to game with retrieval.
- Replaces saturated knowledge benchmarks at the frontier.
Limitations
- Still multiple choice.
- Small total item count limits statistical power.
Related entities
Other AI benchmarks
MMLU
57-subject multiple-choice exam spanning STEM, humanities, social sciences, and professional knowledge.
Open
ARC-AGI
Visual puzzle benchmark designed to resist memorization and test fluid abstraction.
Open
HumanEval
164 hand-written Python programming problems evaluated by unit tests.
Open
SWE-Bench
Real GitHub issues from popular Python repos - does the model produce a patch that passes the project's tests?
Open
