This site demonstrates one possible use of this domain. For acquisition, partnership, or investment inquiries, please use our contact link.

AGI

AGI evaluation challenges

Why measuring AGI is fundamentally harder than measuring task-specific AI — and what credible proposals exist in 2026.

Key takeaways

No agreed AGI definition means no agreed AGI test.
Generalisation to genuinely novel problems is the central evaluation challenge.
Multi-task, long-horizon, and embodied evaluations are the most promising directions.

Why this is hard

AGI is usually defined by what it can do, not how it does it. Any benchmark you write becomes a target the system can be optimised for, breaking the connection between score and underlying generality.

Credible 2026 directions

Benchmarks emphasising novel task distributions (ARC-AGI), long-horizon agentic work (SWE-Bench Verified, GAIA), tool use under uncertainty, and held-out professional tasks released only at evaluation time.

Sources & further reading

On the Measure of Intelligence
Chollet, 2019

Continue exploring

Benchmarks center

Human intelligence metrics

Working memory metrics

Cognitive testing