AI Safety & Alignment

Constitutional AI: Harmlessness from AI Feedback

Bai et al. · 2022 · Anthropic

Trained a helpful, harmless assistant using AI-generated critiques guided by a written constitution.

Research objective

Reduce reliance on human labels for harmlessness training, and make model values explicit and auditable.

Methodology

Used a base model to critique and revise its own outputs against a list of natural-language principles ('the constitution'). Then performed RL using preferences generated by an AI evaluator rather than humans (RLAIF).

Key findings

Models trained with Constitutional AI were as harmless as RLHF models with far less human labeling.
Made the value system inspectable and editable.
Reduced evasive non-answers compared to early RLHF systems.

Strengths

Scales harmlessness training without proportional human-label cost.
Transparent value specification.

Limitations

Quality bounded by the critiquing model's own judgment.
Constitution authoring is itself a values-laden choice.

Practical implications

Foundational technique behind Claude.
Influential template for explicit, document-driven alignment.

Read the original paper

Related entities

Scientist · dario-amodei Atlas · agi

Related research

Training Language Models to Follow Instructions with Human Feedback

Introduced InstructGPT and the now-standard RLHF pipeline for aligning LLMs with human intent.

Read summary

Concrete Problems in AI Safety

Foundational taxonomy of practical safety problems in modern ML systems.

Read summary