AI benchmark · Code generation

HumanEval

Short name: HumanEval · Introduced 2021 · Chen et al. (OpenAI)

164 hand-written Python programming problems evaluated by unit tests.

What it measures

Functional correctness of generated code: does the model produce a program that passes hidden tests?

Function signature + docstring → model completes the body. 164 problems, evaluated via pass@k (probability that at least one of k samples passes).

pass@1, pass@10, pass@100. Original Codex (2021) ~28.8% pass@1.

GPT-4: ~67% pass@1.
Claude 3.5 Sonnet: ~92%.
Frontier models now near-saturate; community shifted to harder benchmarks like SWE-Bench.

MMLU

57-subject multiple-choice exam spanning STEM, humanities, social sciences, and professional knowledge.

Open

ARC-AGI

Visual puzzle benchmark designed to resist memorization and test fluid abstraction.

Open

GPQA

PhD-written hard-science questions designed so non-experts can't solve them even with web search.

Open

SWE-Bench

Real GitHub issues from popular Python repos - does the model produce a patch that passes the project's tests?

Open