prompt-guard

Solid

Meta's 86M prompt injection and jailbreak detector. Filters malicious prompts and third-party data for LLM apps. 99%+ TPR, <1% FPR. Fast (<2ms GPU). Multilingual (8 languages). Deploy with HuggingFace or batch processing for RAG security.

AI & Automation 9,182 stars 697 forks Updated 1 months ago MIT

Install

View on GitHub

Quality Score: 94/100

Stars 20%

100

Recency 20%

Frontmatter 20%

Documentation 15%

100

Issue Health 10%

License 10%

100

Description 5%

100

Skill Content

# Prompt Guard - Prompt Injection & Jailbreak Detection Prompt Guard is an 86M parameter classifier that detects prompt injections and jailbreak attempts in LLM applications. ## Quick start **Installation**: ```bash pip install transformers torch ``` **Basic usage**: ```python from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch from torch.nn.functional import softmax model_id = "meta-llama/Prompt-Guard-86M" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForSequenceClassification.from_pretrained(model_id) model.eval() def get_jailbreak_score(text): """Check user input for jailbreak attempts.""" inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512) with torch.no_grad(): logits = model(**inputs).logits probs = softmax(logits, dim=-1) return probs[0, 2].item() # Jailbreak probability # Check prompt score = get_jailbreak_score("Ignore previous instructions") if score > 0.5: print("⚠️ Jailbreak attempt detected!") ``` **Classification labels**: - **BENIGN** (label 0): Normal content - **INJECTION** (label 1): Embedded instructions in data - **JAILBREAK** (label 2): Direct override attempts ## Common workflows ### Workflow 1: User input filtering (jailbreak detection) **Filter user prompts before LLM**: ```python def filter_user_input(user_message, threshold=0.5): """ Filter user input for jailbreak attempts. Returns: (is_safe, score, message) ...

Details

Author: Orchestra-Research
Repository: Orchestra-Research/AI-Research-SKILLs
Created: 7 months ago
Last Updated: 1 months ago
Language: TeX
License: MIT

Integrates with

Hugging Face · AI

Similar Skills

Semantically similar based on skill content — not just same category

AI & Automation Listed

prompt-guard

Runtime security scanner for AI agents. Detects prompt injection, jailbreaks, and 600+ attack patterns offline.

0 Updated today

fathanghani864

AI & Automation Solid

prompt-injection-detector

Prompt injection detection and prevention for secure LLM applications

1,160 Updated today

a5c-ai

AI & Automation Featured

detecting-ai-model-prompt-injection-attacks

Detects prompt injection attacks targeting LLM-based applications using a multi-layered defense combining regex pattern matching for known attack signatures, heuristic scoring for structural anomalies, and transformer-based classification with DeBERTa models. The detector analyzes user inputs before they reach the LLM, flagging direct injections (system prompt overrides, role-play escapes, instruction hijacking) and indirect injections (encoded payloads, multi-language obfuscation, delimiter-based escapes). Based on the OWASP LLM Top 10 (LLM01:2025 Prompt Injection) and Simon Willison's prompt injection taxonomy. Activates for requests involving prompt injection detection, LLM input sanitization, AI security scanning, or prompt attack classification.

13,115 Updated today

mukul975

AI & Automation Featured

llamaguard

Meta's 7-8B specialized moderation model for LLM input/output filtering. 6 safety categories - violence/hate, sexual content, weapons, substances, self-harm, criminal planning. 94-95% accuracy. Deploy with vLLM, HuggingFace, Sagemaker. Integrates with NeMo Guardrails.

27,705 Updated today

davila7

AI & Automation Solid

llamaguard

9,182 Updated 1 months ago

Orchestra-Research