LLMs

Large Language Models: How They Work and Where They Fail

Large language models are transformer networks trained to predict the next token in vast text corpora. At sufficient scale, this single objective produces remarkably general language understanding, reasoning, and code synthesis ability.

12 min read Updated April 1, 2026

By Dr. Ira S. Pastor· Editor-in-ChiefReviewed by BrainMatter Science Review Board

Key facts

Trained on tens of trillions of tokens of text and code.
Post-training (RLHF, DPO, Constitutional AI) is now standard.
Emergent abilities appear above certain compute thresholds.
Hallucination and brittle reasoning remain the central reliability issues.
Inference cost per unit capability has fallen ~10x per year since 2022.

How LLMs Are Trained

Pretraining: the model predicts the next token across trillions of tokens of web text, books, code, and curated sources. Llama 3 was pretrained on ~15 trillion tokens; frontier 2025 models are estimated in the 20–50 trillion range.

Post-training has multiple stages: supervised fine-tuning (SFT) on curated instruction-following examples, then preference optimization - RLHF (InstructGPT, 2022), DPO, or Constitutional AI / RLAIF - to align outputs with human preferences and safety guidelines.

Inference-time techniques - chain-of-thought prompting, self-consistency, tree-of-thoughts, and explicit reasoning models (OpenAI o-series, DeepSeek-R1) - trade compute at inference for better answers on hard problems.

Emergent Capabilities and Scaling

As scale grows, qualitatively new abilities appear: multi-step arithmetic, code synthesis, in-context learning, instruction following, and rudimentary multi-step reasoning. Wei et al. (2022) catalogued these 'emergent abilities.'

Schaeffer et al. (2023) argued some emergence is a measurement artifact of discontinuous metrics; smoother metrics show continuous improvement. Both interpretations have evidence.

Known Limits

Hallucination - confidently generating plausible but false statements - remains the central reliability problem. Brittle reasoning on novel problems, sycophancy, prompt injection vulnerability, and limited long-horizon planning persist.

Mitigations include retrieval-augmented generation (RAG), tool use and function calling, verifier models, structured decoding, and explicit deliberation. None fully solve the underlying issues.

The 2026 Ecosystem

Frontier closed-weight models: OpenAI GPT-5 series, Anthropic Claude 4 series, Google Gemini 2. Frontier open-weight models: Meta Llama 4, DeepSeek-V3/R1, Qwen 3, Mistral Large 3.

Costs have fallen ~10x per year per unit of capability since 2022. Inference costs for GPT-3.5-class capability are now sub-cent per million tokens; frontier reasoning models remain expensive at the cutting edge.

Frequently asked

Are LLMs intelligent?

They exhibit many components of intelligence - language, broad knowledge, code generation, in-context learning - but lack consistent reasoning, grounding, and long-horizon agency by default.

Can LLMs be trusted?

Not unconditionally. Outputs require verification - especially for factual claims, code, legal or medical guidance, and high-stakes decisions. Citations and tool-grounded answers improve reliability.

What is a 'token'?

A subword unit produced by a tokenizer such as BPE or SentencePiece. Roughly 1 token ≈ 0.75 English words; non-Latin scripts and code are typically less efficient.

What is RLHF?

Reinforcement Learning from Human Feedback: humans compare pairs of model outputs, a reward model learns to predict their preference, and the LLM is fine-tuned (typically via PPO) to maximize predicted reward.

Sources & further reading

Foundations

Machine Learning: The Foundations

Neural Networks

Deep Learning: Hierarchical Representation from Raw Data

Architecture

The Transformer Architecture

Cross-Modal

Multimodal AI: Text, Vision, Audio, Video, and Action

Learning from Reward

Reinforcement Learning: From AlphaGo to RLHF

Autonomy

AI Agents: Tools, Planning, and Autonomy

Back to Artificial Intelligence hub