skill-benchmarkinglisted

Run skill benchmarks with discriminating-only assertions against evals.json for any model and any AI agent. Use when benchmarking a skill against a model not yet tested, running with_skill/without_skill eval pairs, producing benchmark-<model>.json, re-grading an existing run, adding Phase 2 model comparison results, reviewing results in the eval viewer, updating README benchmark tables, or cleaning non-discriminating assertions from evals.json. Enforces strict grader isolation (the context that generates responses never grades them) and evidence-only passing (assertions pass only on explicit content, never on implication or charity). Works with Claude Code, Gemini CLI, GitHub Copilot, Cursor, and any other AI coding assistant.
christim427-rgb/ios-agent-skills · ★ 1 · AI & Automation · score 74

Install: claude install-skill christim427-rgb/ios-agent-skills

# Skill Benchmarking Strict, agent-agnostic benchmark runner for `evals.json` skill evaluation. Produces `benchmark-<model>.json` with pass rates and a discriminating assertion list. Only assertions that actually discriminate between with-skill and without-skill responses are kept; non-discriminating noise is removed via the assertion hygiene process. This skill works with **any AI coding assistant** -- Claude Code, Gemini CLI, GitHub Copilot, Cursor, Windsurf, or any agent that can read files and run shell commands. --- ## Quick Start for Non-Claude Agents If you are using **Gemini CLI**, **GitHub Copilot**, **Cursor**, or another AI coding assistant: 1. **Read this file** (`scripts/benchmarking/SKILL.md`) -- it is the complete workflow guide 2. **Follow the phases below** in order. Each phase tells you exactly what to do 3. **Run Python scripts** via your terminal or shell tool. All scripts use only the Python standard library (no pip installs needed) 4. **For grading** (Phase 3), you MUST use a separate/fresh context that has NOT read the skill being tested. If your agent supports subagents or separate chat sessions, use that. If not, start a new chat session for grading 5. **File paths** in this guide are relative to the repository root. Adjust if your working directory differs ### Key differences from Claude Code usage | Claude Code feature | Equivalent for other agents | |---|---| | `Explore` subagent | Start a fresh chat session, or use your agent's subprocess/