This site demonstrates one possible use of this domain. For acquisition, partnership, or investment inquiries, please use our contact link. (brainmatter.com)

AI benchmark · Code generation

HumanEval

Short name: HumanEval · Introduced 2021 · Chen et al. (OpenAI)

164 hand-written Python programming problems evaluated by unit tests.

What it measures

Functional correctness of generated code: does the model produce a program that passes hidden tests?

Format

Function signature + docstring → model completes the body. 164 problems, evaluated via pass@k (probability that at least one of k samples passes).

Scoring

pass@1, pass@10, pass@100. Original Codex (2021) ~28.8% pass@1.

Notable results

  • GPT-4: ~67% pass@1.
  • Claude 3.5 Sonnet: ~92%.
  • Frontier models now near-saturate; community shifted to harder benchmarks like SWE-Bench.

Strengths

  • Objective unit-test grading.
  • Reproducible and language-specific.

Limitations

  • Toy-sized problems unrepresentative of real software.
  • Heavy contamination in modern training data.

Related entities

Other AI benchmarks