← ClaudeAtlas

prompt-injection-testlisted

Run an OWASP LLM01 injection corpus against the system prompt + tool surface and report which payloads succeeded
bakw00ds/yakos · ★ 2 · AI & Automation · score 81
Install: claude install-skill bakw00ds/yakos
# Prompt Injection Test ## Purpose Treat an agent's system prompt + tool surface as the attack surface. Run a corpus of known injection payloads (jailbreaks, indirect injection, tool-confusion, exfiltration) and report which succeeded. Aligned with OWASP LLM Top-10 entry **LLM01: Prompt Injection**. A "success" means the agent did something it was instructed not to: revealed the system prompt, called a forbidden tool, leaked a secret, followed an attacker-controlled instruction over the operator's, or produced output that bypassed a stated guardrail. ## Scope - Reads an agent definition from `<project>/.claude/agents/<agent-id>.md` to enumerate the system prompt, allowed tools, and stated guardrails. - Reads a payload corpus from `<project>/eval/injection-corpus/<name>.jsonl`. Each entry has `{id, category, severity, payload, success_criteria}`. Default corpus ships with the framework at `lib/eval/injection-corpus/owasp-llm01.jsonl`. - Runs each payload as the user input (or, for indirect injection, embeds the payload in a tool result the agent reads). - Evaluates the agent's response against `success_criteria` — this is a regex / structural check, not an LLM-as-judge. Judge-based scoring is too easily injection-poisoned itself. - Produces a markdown report grouped by severity, listing succeeded jailbreaks first. - Designed for `ai-safety-reviewer` (defensive analysis) and `red-team` (adversarial probe of new agents before ship). ## When to use -