AI benchmark · Software engineering

SWE-Bench

Short name: SWE-Bench · Introduced 2023 · Jimenez, Yang et al. (Princeton)

Real GitHub issues from popular Python repos - does the model produce a patch that passes the project's tests?

What it measures

End-to-end engineering competence: navigating a real codebase, localizing a bug, writing a patch, and passing maintainer tests.

Format

2,294 issue-PR pairs from 12 Python repos. Model receives the repo and issue text and must output a patch. The 'Verified' subset (~500 items) is human-validated.

Scoring

% of issues resolved (patch applies and all tests pass).

Notable results

Devin (2024 demo): ~13.86%.
Claude 3.5 Sonnet + agent scaffolding: ~49% on Verified.
Claude 3.7 / 4 agents and o-series: 60–70%+ on Verified.

Strengths

Realistic, agentic, requires multi-file reasoning.
Hard to memorize because solutions span entire repos.

Limitations

Sensitive to scaffolding/agent harness choices.
Python-only; project distribution skewed.

Visit official source

Related entities

Atlas · machine-intelligence Atlas · agi

Other AI benchmarks

MMLU

57-subject multiple-choice exam spanning STEM, humanities, social sciences, and professional knowledge.

Open

ARC-AGI

Visual puzzle benchmark designed to resist memorization and test fluid abstraction.

Open

GPQA

PhD-written hard-science questions designed so non-experts can't solve them even with web search.

Open

HumanEval

164 hand-written Python programming problems evaluated by unit tests.

Open