This site demonstrates one possible use of this domain. For acquisition, partnership, or investment inquiries, please use our contact link. (brainmatter.com)

AI benchmark · Software engineering

SWE-Bench

Short name: SWE-Bench · Introduced 2023 · Jimenez, Yang et al. (Princeton)

Real GitHub issues from popular Python repos - does the model produce a patch that passes the project's tests?

What it measures

End-to-end engineering competence: navigating a real codebase, localizing a bug, writing a patch, and passing maintainer tests.

Format

2,294 issue-PR pairs from 12 Python repos. Model receives the repo and issue text and must output a patch. The 'Verified' subset (~500 items) is human-validated.

Scoring

% of issues resolved (patch applies and all tests pass).

Notable results

  • Devin (2024 demo): ~13.86%.
  • Claude 3.5 Sonnet + agent scaffolding: ~49% on Verified.
  • Claude 3.7 / 4 agents and o-series: 60–70%+ on Verified.

Strengths

  • Realistic, agentic, requires multi-file reasoning.
  • Hard to memorize because solutions span entire repos.

Limitations

  • Sensitive to scaffolding/agent harness choices.
  • Python-only; project distribution skewed.

Related entities

Other AI benchmarks