This site demonstrates one possible use of this domain. For acquisition, partnership, or investment inquiries, please use our contact link. (brainmatter.com)

AI Safety & Alignment

Constitutional AI: Harmlessness from AI Feedback

Bai et al. · 2022 · Anthropic

Trained a helpful, harmless assistant using AI-generated critiques guided by a written constitution.

Research objective

Reduce reliance on human labels for harmlessness training, and make model values explicit and auditable.

Methodology

Used a base model to critique and revise its own outputs against a list of natural-language principles ('the constitution'). Then performed RL using preferences generated by an AI evaluator rather than humans (RLAIF).

Key findings

  • Models trained with Constitutional AI were as harmless as RLHF models with far less human labeling.
  • Made the value system inspectable and editable.
  • Reduced evasive non-answers compared to early RLHF systems.

Strengths

  • Scales harmlessness training without proportional human-label cost.
  • Transparent value specification.

Limitations

  • Quality bounded by the critiquing model's own judgment.
  • Constitution authoring is itself a values-laden choice.

Practical implications

  • Foundational technique behind Claude.
  • Influential template for explicit, document-driven alignment.

Related entities

Related research