This site demonstrates one possible use of this domain. For acquisition, partnership, or investment inquiries, please use our contact link. (brainmatter.com)

AI Safety & Alignment

Training Language Models to Follow Instructions with Human Feedback

Ouyang et al. · 2022 · OpenAI / NeurIPS

Introduced InstructGPT and the now-standard RLHF pipeline for aligning LLMs with human intent.

Research objective

Make large language models more helpful, honest, and harmless by training them on human preferences.

Methodology

Three-stage pipeline: (1) supervised fine-tuning on human demonstrations, (2) training a reward model from human preference comparisons, (3) reinforcement learning against the reward model using PPO.

Key findings

  • A 1.3B-parameter InstructGPT was preferred to 175B GPT-3 by human raters.
  • RLHF reduced toxic and untruthful outputs.
  • Alignment tax - some capability loss on academic benchmarks - was modest.

Strengths

  • Practical, scalable alignment method that became industry standard.
  • Made LLMs usable as general-purpose assistants.

Limitations

  • Reward models inherit and amplify rater biases.
  • Susceptible to sycophancy and reward hacking.
  • Does not solve deceptive alignment for more capable systems.

Practical implications

  • RLHF underpins ChatGPT, Claude, Gemini, and most deployed assistants.
  • Motivated subsequent work on constitutional AI, DPO, and scalable oversight.

Related entities

Related research