AI benchmark · Fluid reasoning

Abstraction and Reasoning Corpus for AGI

Short name: ARC-AGI · Introduced 2019 · François Chollet

Visual puzzle benchmark designed to resist memorization and test fluid abstraction.

What it measures

Generalization to novel tasks from a handful of examples - Chollet's operational definition of intelligence as skill-acquisition efficiency.

Format

Grid-based visual reasoning tasks (input → output transformations). Each task provides 2–5 training pairs and a held-out test grid. Hundreds of tasks, each unique.

Scoring

Exact-match accuracy on test grids. Human performance ~80–85%; pure LLMs historically <10%.

Notable results

ARC Prize 2024: o3 (high-compute) reached ~75–87% - first system to approach human level, at extreme cost.
Standard frontier LLMs without scaffolding remain in the 20–40% range.

Strengths

Specifically designed to be uncontaminable.
Forces compositional, program-like reasoning.
Strong correlation with human-meaningful intelligence.

Limitations

Narrow modality (2D grids).
Solutions can be brute-forced with sufficient inference compute.

Visit official source

Related entities

Atlas · agi Atlas · fluid-intelligence Glossary · agi

Other AI benchmarks

MMLU

57-subject multiple-choice exam spanning STEM, humanities, social sciences, and professional knowledge.

Open

GPQA

PhD-written hard-science questions designed so non-experts can't solve them even with web search.

Open

HumanEval

164 hand-written Python programming problems evaluated by unit tests.

Open

SWE-Bench

Real GitHub issues from popular Python repos - does the model produce a patch that passes the project's tests?

Open