Comparison · 2026 edition

The Intelligence Index

How do human brains, frontier LLMs, reasoning models, autonomous agents, and a hypothetical AGI compare across the dimensions of cognition? Pick the systems to compare and read the matrix.

Choose systems to compare

Capability matrix

0 = none · 5 = peak

Capability	Human Brain	Frontier LLM	Reasoning Model	Hypothetical AGI
Language Comprehension, generation, and translation of natural language.	5/5	5/5	5/5	5/5
Reasoning Multi-step inference, logic, and problem decomposition.	4/5 Slow but flexible	3/5 Limited by single-pass inference	4/5 Chain-of-thought search	5/5
Mathematics Symbolic manipulation, proof, and quantitative reasoning.	3/5	3/5	4/5	5/5
Long-term Memory Durable, retrievable knowledge across time.	4/5	2/5 Context-window bound	2/5	5/5
Continual Learning Improving from new experience without catastrophic forgetting.	5/5	1/5 Frozen weights at inference	1/5	5/5
Multimodal Perception Integrating vision, audio, and other modalities.	5/5	4/5	3/5	5/5
Embodiment Acting in the physical world with a body and sensors.	5/5	0/5	0/5	4/5
Social Cognition Theory of mind, intent modeling, and cooperation.	5/5	3/5	3/5	5/5
Creativity Novel composition across art, science, and engineering.	5/5	4/5	3/5	5/5
Self-Model Awareness of one's own state, knowledge, and limits.	5/5	2/5	3/5	5/5
Subjective Experience Phenomenal experience - the 'what it is like'.	5/5	0/5	0/5	0/5 Unknown / contested
Energy Efficiency Cognition per joule of compute.	5/5 ~20 watts	1/5 Megawatt-scale training	1/5	2/5 Architecture-dependent

System profiles

Factual reference profiles for every system in the index, with architecture, training scale, headline benchmark results, and known limitations.

Human Brain

Biological

The human brain contains roughly 86 billion neurons and an estimated 100 trillion synapses, organized into a six-layered neocortex, subcortical structures, cerebellum, and brainstem. It runs on about 20 watts of metabolic power - less than a dim lightbulb - yet supports language, abstract reasoning, embodied motor control, and continuous lifelong learning.

Cognition emerges from specialized regions working together: prefrontal cortex for planning and working memory, hippocampus for episodic consolidation, basal ganglia for action selection, and a default mode network active during self-referential thought. Synaptic plasticity lets the brain learn new tasks from a handful of examples - a sample efficiency no current artificial system matches.

Humans remain the only known system with confirmed subjective experience, theory of mind across arbitrary contexts, and the ability to construct and revise their own goals across decades of life.

Neurons: ~86 billion
Synapses: ~100 trillion
Power: ~20 watts
Training data: Embodied lived experience

Chimpanzee

Biological

Chimpanzees (Pan troglodytes) share roughly 98.8% of their DNA with humans and possess brains of about 350 grams containing an estimated 28 billion neurons - roughly a third of the human count, in similar cortical and limbic structures. They use and modify tools in the wild (termite-fishing sticks, hammer-and-anvil nut cracking), pass cultural variants between troops, and recognize themselves in mirrors.

Studies including Ai, Kanzi, and Washoe demonstrate symbol use, short-term memory tasks where chimps outperform adult humans, and limited comprehension of human grammar, though they do not produce recursive syntactic language. Their social cognition supports coalition politics, deception, and reconciliation.

Chimpanzees set a useful biological baseline for what intelligence looks like without human-scale language and without artificial compute.

Neurons: ~28 billion
Genetic overlap with humans: ~98.8%
Verified behaviors: Tool making, culture, mirror self-recognition
Language: Symbol use, no recursive syntax

Frontier LLM

LLM

Frontier large language models such as OpenAI GPT-4o, Anthropic Claude 3.5 Sonnet / Opus, and Google Gemini 1.5/2.0 Ultra are transformer networks with parameter counts in the hundreds of billions to low trillions, trained on tens of trillions of tokens of text, code, and multimodal data. Training runs cost tens to hundreds of millions of dollars and consume megawatt-scale electricity.

After pretraining they are aligned with supervised fine-tuning and reinforcement learning from human or AI feedback (RLHF / RLAIF). Capabilities include fluent multilingual generation, summarization, translation, code synthesis, and vision-language understanding. They score in the 80–90% range on MMLU, pass bar and medical licensing exams, and solve a majority of HumanEval coding problems.

Their cognition is bounded by a fixed context window (typically 128k–2M tokens), frozen weights at inference, and a tendency to hallucinate plausible but false statements when extrapolating beyond training distribution.

Parameters: ~10¹¹–10¹²
Training tokens: 10–30+ trillion
MMLU score: ~85–90%
Context window: 128k – 2M tokens

Reasoning Model

Reasoning models - OpenAI o1 and o3, DeepSeek-R1, Google Gemini 2.0 Flash Thinking, Anthropic's extended-thinking modes - extend the frontier LLM architecture with reinforcement learning over long chain-of-thought traces. At inference they spend additional compute generating, verifying, and revising internal reasoning steps before producing a final answer.

This test-time compute scaling produces large gains on tasks with verifiable solutions: o3 reaches 87.5% on the ARC-AGI semi-private set, 96.7% on AIME 2024 mathematics, and competitive programming Elo above 2700. Costs per query rise by one to two orders of magnitude compared to a single-pass LLM.

They remain frozen-weight at inference and share the multimodal and memory limits of base LLMs, trading latency for accuracy - useful for math, code, scientific reasoning, and formal verification.

Mechanism: RL over chain-of-thought
AIME 2024 (o3): 96.7%
ARC-AGI (o3, semi-private): 87.5%
Inference cost: 10–100× a base LLM

AI Agent

Agent

AI agents wrap a frontier or reasoning model in a control loop that perceives an environment, plans, calls external tools (web browsers, code interpreters, APIs, file systems), and writes to long-term memory. Reference implementations include Anthropic Claude with computer use, OpenAI Operator, Google Project Mariner, Devin, and open-source frameworks such as LangGraph and AutoGPT.

Persistent memory is typically implemented with vector databases, scratchpads, and structured stores rather than weight updates, so learning is online at the memory layer but the underlying model remains frozen. On the SWE-bench Verified software-engineering benchmark, leading agents resolve 60%+ of real GitHub issues; on WebArena and OSWorld they complete a meaningful but minority share of multi-step computer tasks.

Agents extend LLM cognition into action, but inherit hallucination, brittle long-horizon planning, and difficulty recovering from compounding errors in open environments.

Architecture: LLM + tools + memory + control loop
SWE-bench Verified: 60%+ (leading agents)
Embodiment: Browser / OS, not physical
Learning: External memory, frozen weights

Hypothetical AGI

Artificial General Intelligence (AGI) refers to a hypothetical system that matches or exceeds human performance across the full breadth of cognitive tasks - language, reasoning, learning from few examples, transfer to new domains, embodied control, and social cognition - rather than excelling in a narrow band. OpenAI's charter frames AGI as 'highly autonomous systems that outperform humans at most economically valuable work,' while DeepMind's 2023 levels framework defines it as performance at or above the 50th percentile of skilled adults across non-physical tasks.

No system today meets these criteria. Public timelines from researchers at OpenAI, Anthropic, DeepMind, and METR cluster between 2027 and the late 2030s, with substantial disagreement and tail risk in both directions. Required advances likely include continual learning without catastrophic forgetting, robust long-horizon planning, and grounded world models.

Subjective experience in such a system is unresolved and is scored 0 in this index to reflect the absence of accepted evidence, not a claim of impossibility.

Status: Hypothetical - not demonstrated
OpenAI definition: Outperforms humans at most economically valuable work
DeepMind level: ≥50th percentile of skilled adults
Common timeline estimates: 2027 – late 2030s

Capabilities, in depth

Each of the 12 cognitive capabilities the index scores, with where the state of the art stands in 2026.

Language: Comprehension, generation, and translation of natural language.; Frontier LLMs match or exceed average human performance on translation, summarization, and reading comprehension benchmarks. Humans retain the edge on pragmatics, long-range discourse coherence beyond a single context window, and grounding language in lived experience.
Reasoning: Multi-step inference, logic, and problem decomposition.; Single-pass LLMs reason competently across 5–10 steps; reasoning models extend this via chain-of-thought RL, scoring 87.5% on ARC-AGI semi-private. Humans still generalize better to truly novel problem structures without task-specific training.
Mathematics: Symbolic manipulation, proof, and quantitative reasoning.; OpenAI o3 reaches 96.7% on AIME 2024 and 25% on FrontierMath, surpassing typical human mathematicians on contest problems. Open research-level proof remains hard for both humans and machines.
Long-term Memory: Durable, retrievable knowledge across time.; Humans consolidate episodic memory across decades via hippocampal-cortical replay. LLMs store knowledge in weights but cannot update them at inference; agents work around this with vector stores and retrieval augmentation.
Continual Learning: Improving from new experience without catastrophic forgetting.; Humans learn new skills from a handful of examples without losing prior knowledge. Neural networks suffer catastrophic forgetting when fine-tuned; this remains an unsolved research problem and a major gap to AGI.
Multimodal Perception: Integrating vision, audio, and other modalities.; GPT-4o, Gemini, and Claude process vision and audio natively at near-human accuracy on standard benchmarks. Humans retain richer somatic, olfactory, and proprioceptive channels grounded in embodiment.
Embodiment: Acting in the physical world with a body and sensors.; Robotics foundation models (RT-2, Figure 02, Tesla Optimus, Physical Intelligence π0) demonstrate increasing dexterity, but reliability on novel real-world manipulation remains far below biological systems.
Social Cognition: Theory of mind, intent modeling, and cooperation.; LLMs pass many classic false-belief tests in text form, but show fragility under adversarial rephrasing. Humans maintain rich, persistent models of specific individuals over years.
Creativity: Novel composition across art, science, and engineering.; Generative models produce competent novel imagery, music, code, and protein structures (AlphaFold, AlphaProteo). Genuinely paradigm-shifting scientific creativity remains rare and human-led.
Self-Model: Awareness of one's own state, knowledge, and limits.; Calibration studies show frontier models partially know what they do not know, but overconfidence and hallucination persist. Humans have introspective access - itself imperfect - to ongoing thought.
Subjective Experience: Phenomenal experience - the 'what it is like'.; There is no accepted scientific test for phenomenal consciousness. Artificial systems are scored 0 to reflect the absence of evidence, not a claim of impossibility. See the 2023 Butlin et al. report on consciousness indicators in AI.
Energy Efficiency: Cognition per joule of compute.; The human brain runs on ~20 watts. Training a frontier LLM consumes tens of gigawatt-hours; inference for a single complex query can use as much energy as several minutes of human thought.

Frequently asked questions

›What is the Intelligence Index?

The Intelligence Index is a side-by-side comparison of six cognitive systems - the human brain, the chimpanzee, frontier large language models, reasoning models, AI agents, and a hypothetical AGI - scored 0–5 across 12 capabilities including language, reasoning, mathematics, memory, learning, perception, embodiment, social cognition, creativity, self-model, subjective experience, and energy efficiency.

›How are the scores calculated?

Scores are qualitative 2026 estimates calibrated against published benchmark results (MMLU, ARC-AGI, AIME, SWE-bench, HumanEval, GPQA, FrontierMath), peer-reviewed neuroscience literature, and demonstrated real-world capability. They communicate the shape of the difference between systems rather than substitute for domain-specific evaluation.

›How does a frontier LLM compare to the human brain?

Frontier LLMs match or exceed average human performance on language, knowledge recall, and many coding tasks, but lag on continual learning, embodied action, persistent long-term memory, and energy efficiency. The human brain runs on ~20 watts; training a frontier LLM consumes tens of gigawatt-hours.

›What is the difference between a reasoning model and a frontier LLM?

A reasoning model (OpenAI o1/o3, DeepSeek-R1, Gemini 2.0 Flash Thinking) extends a base LLM with reinforcement learning over long chain-of-thought traces and spends additional inference compute generating and verifying internal steps. This produces large gains on verifiable tasks - o3 reaches 96.7% on AIME 2024 and 87.5% on ARC-AGI semi-private - at 10–100× the cost per query.

›Is an AI agent the same as an LLM?

No. An AI agent wraps an LLM or reasoning model in a control loop that perceives an environment, plans, calls external tools, and writes to long-term memory (typically vector databases). Leading agents resolve 60%+ of real GitHub issues on SWE-bench Verified. The underlying model weights remain frozen; learning happens at the memory layer.

›When will AGI be achieved?

There is no consensus. Public estimates from researchers at OpenAI, Anthropic, DeepMind, and METR cluster between 2027 and the late 2030s, with substantial disagreement. Required advances likely include continual learning without catastrophic forgetting, robust long-horizon planning, and grounded world models.

›Why is subjective experience scored 0 for all artificial systems?

Because there is no accepted scientific test for phenomenal consciousness and no published evidence that any current artificial system has it. The 0 score reflects the absence of evidence, not a claim that machine consciousness is impossible. See the 2023 Butlin et al. report on consciousness indicators in AI.

Methodology

Scores are qualitative 2026 estimates based on published research, benchmark results, and demonstrated capability. They are intended to communicate the shape of difference between cognitive systems, not to substitute for domain-specific evaluation (MMLU, ARC-AGI, GPQA, HumanEval, SWE-bench, FrontierMath, etc.). Subjective experience and consciousness are scored 0 for all artificial systems - not as a claim of impossibility, but to reflect the absence of any accepted evidence today.