
AGI
AGI evaluation challenges
Why measuring AGI is fundamentally harder than measuring task-specific AI — and what credible proposals exist in 2026.
Key takeaways
- No agreed AGI definition means no agreed AGI test.
- Generalisation to genuinely novel problems is the central evaluation challenge.
- Multi-task, long-horizon, and embodied evaluations are the most promising directions.
Why this is hard
AGI is usually defined by what it can do, not how it does it. Any benchmark you write becomes a target the system can be optimised for, breaking the connection between score and underlying generality.
Credible 2026 directions
Benchmarks emphasising novel task distributions (ARC-AGI), long-horizon agentic work (SWE-Bench Verified, GAIA), tool use under uncertainty, and held-out professional tasks released only at evaluation time.
