
Large Language Models: How They Work and Where They Fail
Large language models are transformer networks trained to predict the next token in vast text corpora. At sufficient scale, this single objective produces remarkably general language understanding, reasoning, and code synthesis ability.
Key facts
- Trained on tens of trillions of tokens of text and code.
- Post-training (RLHF, DPO, Constitutional AI) is now standard.
- Emergent abilities appear above certain compute thresholds.
- Hallucination and brittle reasoning remain the central reliability issues.
- Inference cost per unit capability has fallen ~10x per year since 2022.
How LLMs Are Trained
Pretraining: the model predicts the next token across trillions of tokens of web text, books, code, and curated sources. Llama 3 was pretrained on ~15 trillion tokens; frontier 2025 models are estimated in the 20–50 trillion range.
Post-training has multiple stages: supervised fine-tuning (SFT) on curated instruction-following examples, then preference optimization — RLHF (InstructGPT, 2022), DPO, or Constitutional AI / RLAIF — to align outputs with human preferences and safety guidelines.
Inference-time techniques — chain-of-thought prompting, self-consistency, tree-of-thoughts, and explicit reasoning models (OpenAI o-series, DeepSeek-R1) — trade compute at inference for better answers on hard problems.
Emergent Capabilities and Scaling
As scale grows, qualitatively new abilities appear: multi-step arithmetic, code synthesis, in-context learning, instruction following, and rudimentary multi-step reasoning. Wei et al. (2022) catalogued these 'emergent abilities.'
Schaeffer et al. (2023) argued some emergence is a measurement artifact of discontinuous metrics; smoother metrics show continuous improvement. Both interpretations have evidence.
Known Limits
Hallucination — confidently generating plausible but false statements — remains the central reliability problem. Brittle reasoning on novel problems, sycophancy, prompt injection vulnerability, and limited long-horizon planning persist.
Mitigations include retrieval-augmented generation (RAG), tool use and function calling, verifier models, structured decoding, and explicit deliberation. None fully solve the underlying issues.
The 2026 Ecosystem
Frontier closed-weight models: OpenAI GPT-5 series, Anthropic Claude 4 series, Google Gemini 2. Frontier open-weight models: Meta Llama 4, DeepSeek-V3/R1, Qwen 3, Mistral Large 3.
Costs have fallen ~10x per year per unit of capability since 2022. Inference costs for GPT-3.5-class capability are now sub-cent per million tokens; frontier reasoning models remain expensive at the cutting edge.
Frequently asked
Are LLMs intelligent?
+
They exhibit many components of intelligence — language, broad knowledge, code generation, in-context learning — but lack consistent reasoning, grounding, and long-horizon agency by default.
Can LLMs be trusted?
+
Not unconditionally. Outputs require verification — especially for factual claims, code, legal or medical guidance, and high-stakes decisions. Citations and tool-grounded answers improve reliability.
What is a 'token'?
+
A subword unit produced by a tokenizer such as BPE or SentencePiece. Roughly 1 token ≈ 0.75 English words; non-Latin scripts and code are typically less efficient.
What is RLHF?
+
Reinforcement Learning from Human Feedback: humans compare pairs of model outputs, a reward model learns to predict their preference, and the LLM is fine-tuned (typically via PPO) to maximize predicted reward.
Sources & further reading
Continue in this series
Foundations
Machine Learning: The Foundations
Neural Networks
Deep Learning: Hierarchical Representation from Raw Data
Architecture
The Transformer Architecture
Cross-Modal
Multimodal AI: Text, Vision, Audio, Video, and Action
Learning from Reward
Reinforcement Learning: From AlphaGo to RLHF
Autonomy
AI Agents: Tools, Planning, and Autonomy
