Learning from Reward

Reinforcement Learning: From AlphaGo to RLHF

Reinforcement learning (RL) trains agents to maximize cumulative reward through trial and error. It is the framework behind AlphaGo, AlphaZero, OpenAI Five, RLHF, and reasoning models like OpenAI o-series and DeepSeek-R1.

10 min read Updated April 5, 2026

By Dr. Ira S. Pastor· Editor-in-ChiefReviewed by BrainMatter Science Review Board

Key facts

Formalized as Markov Decision Processes (Bellman, 1957).
DQN (2013) and AlphaGo (2016) were the field's public turning points.
RLHF brought RL to language model alignment in 2022.
Reasoning models use RL on verifiable rewards (math, code) to train chain-of-thought.
Sample efficiency and reward specification remain the hardest open problems.

The MDP Framework

RL formalizes sequential decisions as Markov Decision Processes: a set of states, actions, transition dynamics, and a reward function. The agent learns a policy - a mapping from states to actions - that maximizes the expected discounted sum of future rewards.

Core algorithmic families: value-based (Q-learning, DQN), policy-gradient (REINFORCE, PPO, GRPO), actor-critic hybrids (A3C, SAC), and model-based methods (Dreamer, MuZero) that learn a model of the environment.

Milestones

TD-Gammon (Tesauro, 1992) mastered backgammon with a neural network and temporal-difference learning. DQN (DeepMind, 2013) reached human-level on Atari games from raw pixels.

AlphaGo (2016) defeated world champion Lee Sedol; AlphaZero (2017) generalized to chess and shogi via pure self-play. AlphaStar and OpenAI Five (2018–19) mastered StarCraft II and Dota 2.

RLHF (Christiano et al., 2017; Ouyang et al., 2022) brought RL to language model alignment. Reasoning models (o1, o3, DeepSeek-R1, 2024–2025) use RL on verifiable rewards to train explicit chain-of-thought.

Persistent Challenges

Sample efficiency: deep RL often needs millions of interactions, infeasible in real-world settings. Reward specification: optimizing the wrong proxy produces 'reward hacking.' Exploration vs exploitation, and safe deployment during learning, remain hard.

Offline RL (learning from logged data), model-based RL, and inverse RL (inferring rewards from demonstrations) address some of these limitations.

RLHF and Modern LLM Alignment

In RLHF, humans compare pairs of model outputs; a reward model is trained on these preferences; the LLM is fine-tuned to maximize predicted reward via PPO or simpler methods like DPO.

Constitutional AI (Anthropic) and RLAIF replace human labels with model-generated critiques against a written set of principles, dramatically reducing labeling cost.

Frequently asked

How does RLHF work?

A reward model is trained on human preference comparisons between pairs of LLM outputs. The LLM is then fine-tuned - typically via PPO or DPO - to maximize predicted reward.

Is RL dangerous?

Reward misspecification ('reward hacking') is a recognized risk - optimizing the wrong proxy can produce surprising and undesirable behavior. The boat-racing CoastRunners example is a classic illustration.

What is self-play?

An RL setup where an agent improves by playing against copies of itself. AlphaZero used pure self-play with no human game data to master chess, shogi, and Go.

Sources & further reading

Foundations

Machine Learning: The Foundations

Neural Networks

Deep Learning: Hierarchical Representation from Raw Data

Architecture

The Transformer Architecture

LLMs

Large Language Models: How They Work and Where They Fail

Cross-Modal

Multimodal AI: Text, Vision, Audio, Video, and Action

Autonomy

AI Agents: Tools, Planning, and Autonomy

Back to Artificial Intelligence hub