
Reinforcement Learning: From AlphaGo to RLHF
Reinforcement learning (RL) trains agents to maximize cumulative reward through trial and error. It is the framework behind AlphaGo, AlphaZero, OpenAI Five, RLHF, and reasoning models like OpenAI o-series and DeepSeek-R1.
Key facts
- Formalized as Markov Decision Processes (Bellman, 1957).
- DQN (2013) and AlphaGo (2016) were the field's public turning points.
- RLHF brought RL to language model alignment in 2022.
- Reasoning models use RL on verifiable rewards (math, code) to train chain-of-thought.
- Sample efficiency and reward specification remain the hardest open problems.
The MDP Framework
RL formalizes sequential decisions as Markov Decision Processes: a set of states, actions, transition dynamics, and a reward function. The agent learns a policy — a mapping from states to actions — that maximizes the expected discounted sum of future rewards.
Core algorithmic families: value-based (Q-learning, DQN), policy-gradient (REINFORCE, PPO, GRPO), actor-critic hybrids (A3C, SAC), and model-based methods (Dreamer, MuZero) that learn a model of the environment.
Milestones
TD-Gammon (Tesauro, 1992) mastered backgammon with a neural network and temporal-difference learning. DQN (DeepMind, 2013) reached human-level on Atari games from raw pixels.
AlphaGo (2016) defeated world champion Lee Sedol; AlphaZero (2017) generalized to chess and shogi via pure self-play. AlphaStar and OpenAI Five (2018–19) mastered StarCraft II and Dota 2.
RLHF (Christiano et al., 2017; Ouyang et al., 2022) brought RL to language model alignment. Reasoning models (o1, o3, DeepSeek-R1, 2024–2025) use RL on verifiable rewards to train explicit chain-of-thought.
Persistent Challenges
Sample efficiency: deep RL often needs millions of interactions, infeasible in real-world settings. Reward specification: optimizing the wrong proxy produces 'reward hacking.' Exploration vs exploitation, and safe deployment during learning, remain hard.
Offline RL (learning from logged data), model-based RL, and inverse RL (inferring rewards from demonstrations) address some of these limitations.
RLHF and Modern LLM Alignment
In RLHF, humans compare pairs of model outputs; a reward model is trained on these preferences; the LLM is fine-tuned to maximize predicted reward via PPO or simpler methods like DPO.
Constitutional AI (Anthropic) and RLAIF replace human labels with model-generated critiques against a written set of principles, dramatically reducing labeling cost.
Frequently asked
How does RLHF work?
+
A reward model is trained on human preference comparisons between pairs of LLM outputs. The LLM is then fine-tuned — typically via PPO or DPO — to maximize predicted reward.
Is RL dangerous?
+
Reward misspecification ('reward hacking') is a recognized risk — optimizing the wrong proxy can produce surprising and undesirable behavior. The boat-racing CoastRunners example is a classic illustration.
What is self-play?
+
An RL setup where an agent improves by playing against copies of itself. AlphaZero used pure self-play with no human game data to master chess, shogi, and Go.
Sources & further reading
Continue in this series
Foundations
Machine Learning: The Foundations
Neural Networks
Deep Learning: Hierarchical Representation from Raw Data
Architecture
The Transformer Architecture
LLMs
Large Language Models: How They Work and Where They Fail
Cross-Modal
Multimodal AI: Text, Vision, Audio, Video, and Action
Autonomy
AI Agents: Tools, Planning, and Autonomy
