AI Safety & Alignment
Training Language Models to Follow Instructions with Human Feedback
Ouyang et al. · 2022 · OpenAI / NeurIPS
Introduced InstructGPT and the now-standard RLHF pipeline for aligning LLMs with human intent.
Research objective
Make large language models more helpful, honest, and harmless by training them on human preferences.
Methodology
Three-stage pipeline: (1) supervised fine-tuning on human demonstrations, (2) training a reward model from human preference comparisons, (3) reinforcement learning against the reward model using PPO.
Key findings
- A 1.3B-parameter InstructGPT was preferred to 175B GPT-3 by human raters.
- RLHF reduced toxic and untruthful outputs.
- Alignment tax - some capability loss on academic benchmarks - was modest.
Strengths
- Practical, scalable alignment method that became industry standard.
- Made LLMs usable as general-purpose assistants.
Limitations
- Reward models inherit and amplify rater biases.
- Susceptible to sycophancy and reward hacking.
- Does not solve deceptive alignment for more capable systems.
Practical implications
- RLHF underpins ChatGPT, Claude, Gemini, and most deployed assistants.
- Motivated subsequent work on constitutional AI, DPO, and scalable oversight.
