AI Safety & Alignment

Concrete Problems in AI Safety

Amodei, Olah, Steinhardt, Christiano, Schulman, Mané · 2016 · arXiv

Foundational taxonomy of practical safety problems in modern ML systems.

Research objective

Translate abstract AI-risk concerns into concrete, tractable research problems.

Methodology

Conceptual analysis with worked examples. Identified five categories: avoiding negative side effects, reward hacking, scalable oversight, safe exploration, and robustness to distributional shift.

Key findings

Many alignment problems can be studied in current ML systems, not only future AGI.
Reward specification is brittle and prone to gaming.
Safe exploration is critical for deployed agents.

Strengths

Made AI safety legible to mainstream ML researchers.
Catalyzed a generation of empirical alignment work.

Limitations

Focused on RL-style agents; less direct mapping to modern LLMs.
Did not anticipate the central role of language-model alignment.

Practical implications

Foundational reading for the alignment field.
Many co-authors later founded Anthropic and shaped industry safety practice.

Read the original paper

Related entities

Scientist · dario-amodei Atlas · agi Atlas · superintelligence Glossary · alignment Glossary · reward-hacking

Related research

Training Language Models to Follow Instructions with Human Feedback

Introduced InstructGPT and the now-standard RLHF pipeline for aligning LLMs with human intent.

Read summary

Constitutional AI: Harmlessness from AI Feedback

Trained a helpful, harmless assistant using AI-generated critiques guided by a written constitution.

Read summary