AI Safety & Alignment
Concrete Problems in AI Safety
Amodei, Olah, Steinhardt, Christiano, Schulman, Mané · 2016 · arXiv
Foundational taxonomy of practical safety problems in modern ML systems.
Research objective
Translate abstract AI-risk concerns into concrete, tractable research problems.
Methodology
Conceptual analysis with worked examples. Identified five categories: avoiding negative side effects, reward hacking, scalable oversight, safe exploration, and robustness to distributional shift.
Key findings
- Many alignment problems can be studied in current ML systems, not only future AGI.
- Reward specification is brittle and prone to gaming.
- Safe exploration is critical for deployed agents.
Strengths
- Made AI safety legible to mainstream ML researchers.
- Catalyzed a generation of empirical alignment work.
Limitations
- Focused on RL-style agents; less direct mapping to modern LLMs.
- Did not anticipate the central role of language-model alignment.
Practical implications
- Foundational reading for the alignment field.
- Many co-authors later founded Anthropic and shaped industry safety practice.
Related entities
Related research
Training Language Models to Follow Instructions with Human Feedback
Introduced InstructGPT and the now-standard RLHF pipeline for aligning LLMs with human intent.
Read summary
Constitutional AI: Harmlessness from AI Feedback
Trained a helpful, harmless assistant using AI-generated critiques guided by a written constitution.
Read summary
