This site demonstrates one possible use of this domain. For acquisition, partnership, or investment inquiries, please use our contact link. (brainmatter.com)

AI benchmark · Expert reasoning

Graduate-Level Google-Proof Q&A

Short name: GPQA · Introduced 2023 · Rein et al.

PhD-written hard-science questions designed so non-experts can't solve them even with web search.

What it measures

Deep domain reasoning in biology, chemistry, and physics at the graduate level - minimizing the lookup shortcut.

Format

448 multiple-choice questions written by domain PhDs, validated by other PhDs, and confirmed unsolvable by skilled non-experts with unrestricted web access.

Scoring

Accuracy on the 'Diamond' subset (198 highest-quality items). Domain-expert baseline ~65%; skilled non-experts ~34%.

Notable results

  • GPT-4 (2023): ~39%.
  • Claude 3.5 Sonnet (2024): ~59.4%.
  • o1 and o3 series: >75%, surpassing human PhD average.

Strengths

  • Hard to game with retrieval.
  • Replaces saturated knowledge benchmarks at the frontier.

Limitations

  • Still multiple choice.
  • Small total item count limits statistical power.

Related entities

Other AI benchmarks