skill-benchmarkinglisted
Install: claude install-skill christim427-rgb/ios-agent-skills
# Skill Benchmarking
Strict, agent-agnostic benchmark runner for `evals.json` skill evaluation. Produces `benchmark-<model>.json` with pass rates and a discriminating assertion list. Only assertions that actually discriminate between with-skill and without-skill responses are kept; non-discriminating noise is removed via the assertion hygiene process.
This skill works with **any AI coding assistant** -- Claude Code, Gemini CLI, GitHub Copilot, Cursor, Windsurf, or any agent that can read files and run shell commands.
---
## Quick Start for Non-Claude Agents
If you are using **Gemini CLI**, **GitHub Copilot**, **Cursor**, or another AI coding assistant:
1. **Read this file** (`scripts/benchmarking/SKILL.md`) -- it is the complete workflow guide
2. **Follow the phases below** in order. Each phase tells you exactly what to do
3. **Run Python scripts** via your terminal or shell tool. All scripts use only the Python standard library (no pip installs needed)
4. **For grading** (Phase 3), you MUST use a separate/fresh context that has NOT read the skill being tested. If your agent supports subagents or separate chat sessions, use that. If not, start a new chat session for grading
5. **File paths** in this guide are relative to the repository root. Adjust if your working directory differs
### Key differences from Claude Code usage
| Claude Code feature | Equivalent for other agents |
|---|---|
| `Explore` subagent | Start a fresh chat session, or use your agent's subprocess/