content-moderation-patterns

Solid

Content moderation with Claude: pre-filter vs LLM-classify, categories, thresholds, HITL. Triggers: moderation, safety filter, policy enforcement, content classifier.

AI & Automation 155 stars 19 forks Updated 2 days ago MIT

Install

View on GitHub

Quality Score: 93/100

Stars 20%
73
Recency 20%
100
Frontmatter 20%
70
Documentation 15%
100
Issue Health 10%
80
License 10%
100
Description 5%
100

Skill Content

# Content Moderation Patterns Two-stage pattern that balances cost, latency, and quality: cheap deterministic filters first, then LLM classification only on survivors. ## Architecture ``` [ input ] │ ▼ [ pre-filter ] ── (regex, allow/block lists, length check) ──► reject early │ ▼ [ LLM classifier ] ── (Haiku, structured output) ──► categories + confidence │ ▼ [ decision router ] ├── high confidence + policy violation → reject ├── high confidence + clean → pass └── low confidence or edge categories → human review queue ``` ## Pre-filter Stage (cheap) Catch the obvious cases before paying an LLM call: ```python BANNED_PATTERNS = [ re.compile(r"\b(banned_term_1|banned_term_2)\b", re.I), re.compile(r"\bhttps?://(?!allowed-domain\.com)", re.I), # external links ] def pre_filter(text: str) -> tuple[bool, str]: if len(text) > 10_000: return False, "too_long" for pat in BANNED_PATTERNS: if pat.search(text): return False, f"banned_pattern:{pat.pattern}" return True, "pass" ``` Roughly 40-70% of spammy input should die here. Log counts by rule so you can tune. ## LLM Classifier Stage (Haiku) Use the smallest capable model. Haiku is usually right for moderation. ```python CATEGORIES = ["harassment", "self_harm", "spam", "off_topic", "pii", "clean"] def classify(text: str) -> dict: response = client.messages.create( model="claude-haiku-4-5", max_tokens=256, tools=[{ ...

Details

Author
softspark
Repository
softspark/ai-toolkit
Created
2 months ago
Last Updated
2 days ago
Language
Python
License
MIT

Integrates with

Similar Skills

Semantically similar based on skill content — not just same category