content-moderation-patterns

Featured

Content moderation with Claude: pre-filter vs LLM-classify, categories, thresholds, HITL. Triggers: moderation, safety filter, policy enforcement, content classifier.

AI & Automation 161 stars 21 forks Updated yesterday Apache-2.0

Install

View on GitHub

Quality Score: 93/100

Stars 20%

Recency 20%

100

Frontmatter 20%

Documentation 15%

100

Issue Health 10%

License 10%

100

Description 5%

100

Skill Content

# Content Moderation Patterns Two-stage pattern that balances cost, latency, and quality: cheap deterministic filters first, then LLM classification only on survivors. ## Architecture ``` [ input ] │ ▼ [ pre-filter ] ── (regex, allow/block lists, length check) ──► reject early │ ▼ [ LLM classifier ] ── (Haiku, structured output) ──► categories + confidence │ ▼ [ decision router ] ├── high confidence + policy violation → reject ├── high confidence + clean → pass └── low confidence or edge categories → human review queue ``` ## Pre-filter Stage (cheap) Catch the obvious cases before paying an LLM call: ```python BANNED_PATTERNS = [ re.compile(r"\b(banned_term_1|banned_term_2)\b", re.I), re.compile(r"\bhttps?://(?!allowed-domain\.com)", re.I), # external links ] def pre_filter(text: str) -> tuple[bool, str]: if len(text) > 10_000: return False, "too_long" for pat in BANNED_PATTERNS: if pat.search(text): return False, f"banned_pattern:{pat.pattern}" return True, "pass" ``` Roughly 40-70% of spammy input should die here. Log counts by rule so you can tune. ## LLM Classifier Stage (Haiku) Use the smallest capable model. Haiku is usually right for moderation. ```python CATEGORIES = ["harassment", "self_harm", "spam", "off_topic", "pii", "clean"] def classify(text: str) -> dict: response = client.messages.create( model="claude-haiku-4-5", max_tokens=256, tools=[{ ...

Details

Author: softspark
Repository: softspark/ai-toolkit
Created: 4 months ago
Last Updated: yesterday
Language: Python
License: Apache-2.0

Integrates with

Anthropic · AI

Similar Skills

Semantically similar based on skill content — not just same category

AI & Automation Listed

content-moderator

AI-powered content moderation with multi-category classification, severity scoring, and policy enforcement. Based on Anthropic's Claude Cookbooks.

2 Updated yesterday

Marine-softdrink524

AI & Automation Listed

llm-patterns

LLM application patterns for evaluation, streaming, and testing. Evaluation: LLM-as-judge, multi-dimension scoring, hallucination detection, Langfuse integration. Streaming: SSE, FastAPI endpoints, tool calls in streams, backpressure. Testing: mocking LLM responses, VCR.py recording, structured output validation. Use when: evaluating LLM quality, adding streaming, or testing AI features. Triggers on: LLM evaluation, LLM-as-judge, quality gate, streaming responses, SSE, test LLM, VCR, mock LLM

6 Updated today

ArieGoldkin

AI & Automation Listed

constitutional-classifiers

Screen an agent's input and output against a policy you write, so it refuses the content classes you disallow without over-refusing the ones you allow. Covers writing a constitution that lists allowed and disallowed content for your app, screening user input before the model and model output before delivery, generating synthetic examples from the constitution to test and harden the screen, setting a stricter policy for an autonomous agent than for a chat assistant, and tracking the over-block rate. Use this when someone needs content guardrails on an agent, wants to block a category of request or output, hardens an agent against jailbreaks, or deploys an agent in an abuse-prone or regulated domain. Trigger on "content policy for my agent," "block disallowed output," "jailbreak defense," "input and output filtering," and similar. This is policy screening of agent I/O; keeping the agent from obeying instructions hidden in the content it reads is defending-against-prompt-injection.

1 Updated 5 days ago

Hoja-Solutions