prompt-evaluation-runnerlisted
Install: claude install-skill yeaight7/agent-powerups
# Prompt Evaluation Runner
## When to use
Use when you need to evaluate an LLM app, test a prompt systematically, or run red-team/vulnerability scans against a target model or application.
## Requirements / Checks
1. Check if an evaluation tool is defined in project deps, scripts, lockfiles, or local toolchain (e.g., `promptfoo`, `evals`, `braintrust`).
2. Do not run unvetted remote runners without checking the project's toolchain first (e.g., avoid `npx promptfoo@latest` if `promptfoo` is already installed locally).
3. If no runner exists, ask before adding a dev dependency or using an ephemeral runner.
4. Confirm expected cost, provider, API keys, and network target before any execution.
## Workflow
1. **Define risk** — state target behavior, failure mode, provider(s), and budget limits before writing any config.
2. **Choose assertions** — prefer deterministic checks first:
| Assertion type | When to use |
|---|---|
| `contains` / `not-contains` | Output must include/exclude specific text |
| `regex` | Structured output pattern (e.g., JSON key present) |
| `json-schema` | Output must conform to a schema |
| `cost` | Must stay under a token/dollar budget |
| `latency` | Must respond within N ms |
| `javascript` / `python` | Custom logic when simpler types don't fit |
| Model grader | Last resort — only for subjective quality checks |
3. **Use model graders sparingly** — pin the grader model and provider explicitly; document the cost and no