AI benchmark · Software engineering
SWE-Bench
Short name: SWE-Bench · Introduced 2023 · Jimenez, Yang et al. (Princeton)
Real GitHub issues from popular Python repos - does the model produce a patch that passes the project's tests?
What it measures
End-to-end engineering competence: navigating a real codebase, localizing a bug, writing a patch, and passing maintainer tests.
Format
2,294 issue-PR pairs from 12 Python repos. Model receives the repo and issue text and must output a patch. The 'Verified' subset (~500 items) is human-validated.
Scoring
% of issues resolved (patch applies and all tests pass).
Notable results
- Devin (2024 demo): ~13.86%.
- Claude 3.5 Sonnet + agent scaffolding: ~49% on Verified.
- Claude 3.7 / 4 agents and o-series: 60–70%+ on Verified.
Strengths
- Realistic, agentic, requires multi-file reasoning.
- Hard to memorize because solutions span entire repos.
Limitations
- Sensitive to scaffolding/agent harness choices.
- Python-only; project distribution skewed.
Related entities
Other AI benchmarks
MMLU
57-subject multiple-choice exam spanning STEM, humanities, social sciences, and professional knowledge.
Open
ARC-AGI
Visual puzzle benchmark designed to resist memorization and test fluid abstraction.
Open
GPQA
PhD-written hard-science questions designed so non-experts can't solve them even with web search.
Open
HumanEval
164 hand-written Python programming problems evaluated by unit tests.
Open
