AI Safety & Alignment

Training Language Models to Follow Instructions with Human Feedback

Ouyang et al. · 2022 · OpenAI / NeurIPS

Introduced InstructGPT and the now-standard RLHF pipeline for aligning LLMs with human intent.

Research objective

Make large language models more helpful, honest, and harmless by training them on human preferences.

Methodology

Three-stage pipeline: (1) supervised fine-tuning on human demonstrations, (2) training a reward model from human preference comparisons, (3) reinforcement learning against the reward model using PPO.

Key findings

A 1.3B-parameter InstructGPT was preferred to 175B GPT-3 by human raters.
RLHF reduced toxic and untruthful outputs.
Alignment tax - some capability loss on academic benchmarks - was modest.

Strengths

Practical, scalable alignment method that became industry standard.
Made LLMs usable as general-purpose assistants.

Limitations

Reward models inherit and amplify rater biases.
Susceptible to sycophancy and reward hacking.
Does not solve deceptive alignment for more capable systems.

Practical implications

RLHF underpins ChatGPT, Claude, Gemini, and most deployed assistants.
Motivated subsequent work on constitutional AI, DPO, and scalable oversight.

Read the original paper

Related entities

Scientist · dario-amodei Atlas · agi Glossary · alignment Glossary · rlhf

Related research

Concrete Problems in AI Safety

Foundational taxonomy of practical safety problems in modern ML systems.

Read summary

Constitutional AI: Harmlessness from AI Feedback

Trained a helpful, harmless assistant using AI-generated critiques guided by a written constitution.

Read summary