AI benchmark · Fluid reasoning
Abstraction and Reasoning Corpus for AGI
Short name: ARC-AGI · Introduced 2019 · François Chollet
Visual puzzle benchmark designed to resist memorization and test fluid abstraction.
What it measures
Generalization to novel tasks from a handful of examples - Chollet's operational definition of intelligence as skill-acquisition efficiency.
Format
Grid-based visual reasoning tasks (input → output transformations). Each task provides 2–5 training pairs and a held-out test grid. Hundreds of tasks, each unique.
Scoring
Exact-match accuracy on test grids. Human performance ~80–85%; pure LLMs historically <10%.
Notable results
- ARC Prize 2024: o3 (high-compute) reached ~75–87% - first system to approach human level, at extreme cost.
- Standard frontier LLMs without scaffolding remain in the 20–40% range.
Strengths
- Specifically designed to be uncontaminable.
- Forces compositional, program-like reasoning.
- Strong correlation with human-meaningful intelligence.
Limitations
- Narrow modality (2D grids).
- Solutions can be brute-forced with sufficient inference compute.
Related entities
Other AI benchmarks
MMLU
57-subject multiple-choice exam spanning STEM, humanities, social sciences, and professional knowledge.
Open
GPQA
PhD-written hard-science questions designed so non-experts can't solve them even with web search.
Open
HumanEval
164 hand-written Python programming problems evaluated by unit tests.
Open
SWE-Bench
Real GitHub issues from popular Python repos - does the model produce a patch that passes the project's tests?
Open
