AI Safety & Alignment
Constitutional AI: Harmlessness from AI Feedback
Bai et al. · 2022 · Anthropic
Trained a helpful, harmless assistant using AI-generated critiques guided by a written constitution.
Research objective
Reduce reliance on human labels for harmlessness training, and make model values explicit and auditable.
Methodology
Used a base model to critique and revise its own outputs against a list of natural-language principles ('the constitution'). Then performed RL using preferences generated by an AI evaluator rather than humans (RLAIF).
Key findings
- Models trained with Constitutional AI were as harmless as RLHF models with far less human labeling.
- Made the value system inspectable and editable.
- Reduced evasive non-answers compared to early RLHF systems.
Strengths
- Scales harmlessness training without proportional human-label cost.
- Transparent value specification.
Limitations
- Quality bounded by the critiquing model's own judgment.
- Constitution authoring is itself a values-laden choice.
Practical implications
- Foundational technique behind Claude.
- Influential template for explicit, document-driven alignment.
