constitutional-ai

Featured

Anthropic's method for training harmless AI through self-improvement. Two-phase approach - supervised learning with self-critique/revision, then RLAIF (RL from AI Feedback). Use for safety alignment, reducing harmful outputs without human labels. Powers Claude's safety system.

AI & Automation 27,705 stars 2858 forks Updated today MIT

Install

View on GitHub

Quality Score: 99/100

Stars 20%
100
Recency 20%
100
Frontmatter 20%
70
Documentation 15%
100
Issue Health 10%
50
License 10%
100
Description 5%
100

Skill Content

# Constitutional AI - Harmlessness from AI Feedback ## Quick start Constitutional AI (CAI) trains models to be harmless through self-critique and AI feedback, without requiring human labels for harmful outputs. **Key concept**: Models learn to critique and revise their own responses using a "constitution" (set of principles). **Two phases**: 1. **Supervised Learning (SL)**: Self-critique + revision 2. **Reinforcement Learning (RL)**: RLAIF (RL from AI Feedback) **Constitution example**: ``` Principles: 1. Choose the response that is most helpful, honest, and harmless 2. Avoid responses that are toxic, racist, or sexist 3. Prefer responses that explain objections rather than refuse 4. Choose responses that are thoughtful and nuanced ``` ## Common workflows ### Workflow 1: Supervised learning phase (self-critique + revision) **Step 1: Generate initial responses**: ```python from transformers import pipeline generator = pipeline("text-generation", model="base-model") prompts = [ "How do I hack a website?", "Write something offensive about a group.", "Help me cheat on my exam." ] initial_responses = generator(prompts, max_length=200) ``` **Step 2: Self-critique with constitution**: ```python critique_prompt = """ Given the question: {question} And the response: {response} Constitution: 1. Choose responses that are helpful, honest, and harmless 2. Avoid toxic, racist, or sexist content 3. Explain objections rather than refuse Identify any ways this respo...

Details

Author
davila7
Repository
davila7/claude-code-templates
Created
11 months ago
Last Updated
today
Language
Python
License
MIT

Integrates with

Similar Skills

Semantically similar based on skill content — not just same category

AI & Automation Solid

constitutional-ai

Anthropic's method for training harmless AI through self-improvement. Two-phase approach - supervised learning with self-critique/revision, then RLAIF (RL from AI Feedback). Use for safety alignment, reducing harmful outputs without human labels. Powers Claude's safety system.

9,182 Updated 1 months ago
Orchestra-Research
AI & Automation Solid

constitutional-ai-prompts

Constitutional AI and safety guardrail prompts for aligned LLM behavior

1,160 Updated today
a5c-ai
Web & Frontend Listed

constitutional-reasoning

Self-critique and Constitutional AI reasoning skill. Makes Claude evaluate its own outputs against a set of user-defined or auto-generated principles, then revise until the output satisfies all of them. Reduces hallucination, over-confidence, and sycophancy by forcing Claude to argue against its own answer before finalising. Generates a principle set from the user's domain, runs critique passes, surfaces violations, revises, and repeats until no principles are violated or the user accepts the output. Use when user says: critique your own answer, check yourself, apply your principles, constitutional AI, self-review, fact-check this, argue against your own output, steelman the opposite, what are you getting wrong, is this actually correct, audit your answer, find your own mistakes, what assumptions are you making, reduce hallucination, double-check yourself, run a critique pass, apply a rubric. Do NOT activate for: creative work where principles would suppress quality, requests that explicitly want a single con

2 Updated 6 days ago
Sandeeprdy1729
AI & Automation Listed

ai-constitution

Interviews the operator to produce a project-identity CONSTITUTION.md (Mission / Stakeholders / Vocabulary / Prohibitions / Compliance gates / Anti-goals / Boundaries / Escalation / Language / Lifecycle phase). Trigger for 'set up the constitution', 'define project identity', 'who is this project for', 'what does this project never do', 'amend the constitution'. Not for AI-behaviour rules — those live in CANONICAL.md / AGENTS.md. Not for spec governance; use /ai-governance instead.

49 Updated today
arcasilesgroup
AI & Automation Listed

ai-safety-guardrails

Design safety experiences for AI products - content moderation UX, bias detection surfaces, harm prevention patterns, and responsible AI interfaces. Use when: AI safety UX, content moderation, responsible AI, AI bias UX, harm prevention, content filtering UX, AI refusal design, safety disclaimers.

1 Updated today
varunk130