AI benchmark · Code generation
HumanEval
Short name: HumanEval · Introduced 2021 · Chen et al. (OpenAI)
164 hand-written Python programming problems evaluated by unit tests.
What it measures
Functional correctness of generated code: does the model produce a program that passes hidden tests?
Format
Function signature + docstring → model completes the body. 164 problems, evaluated via pass@k (probability that at least one of k samples passes).
Scoring
pass@1, pass@10, pass@100. Original Codex (2021) ~28.8% pass@1.
Notable results
- GPT-4: ~67% pass@1.
- Claude 3.5 Sonnet: ~92%.
- Frontier models now near-saturate; community shifted to harder benchmarks like SWE-Bench.
Strengths
- Objective unit-test grading.
- Reproducible and language-specific.
Limitations
- Toy-sized problems unrepresentative of real software.
- Heavy contamination in modern training data.
Related entities
Other AI benchmarks
MMLU
57-subject multiple-choice exam spanning STEM, humanities, social sciences, and professional knowledge.
Open
ARC-AGI
Visual puzzle benchmark designed to resist memorization and test fluid abstraction.
Open
GPQA
PhD-written hard-science questions designed so non-experts can't solve them even with web search.
Open
SWE-Bench
Real GitHub issues from popular Python repos - does the model produce a patch that passes the project's tests?
Open
