skill-builderlisted

Create new skills, modify and improve existing skills, and measure skill performance with **unbiased blind evaluations**. Drop-in replacement for `skill-creator` that fixes its central design flaw — the same agent writing the skill also writing the tests, which lets implementation knowledge contaminate prompt selection. `skill-builder` delegates eval authoring to a blind subagent that sees only the skill's name + description, never the SKILL.md body, scripts, or fixtures. Use whenever the user wants to create, modify, or benchmark a skill, or optimize a skill's description for triggering accuracy.
Stoica-Mihai/claude-skills · ★ 0 · AI & Automation · score 70

Install: claude install-skill Stoica-Mihai/claude-skills

# Skill Builder A skill for creating new skills and iteratively improving them, with one deliberate departure from the upstream `skill-creator` workflow: **eval prompts and assertions are authored by a blind subagent**, never by the same agent that designed the skill. See [Test Cases](#test-cases) for mechanics. This skill is derived from Anthropic's `skill-creator` under Apache 2.0. Plumbing for the eval viewer, benchmark aggregation, description optimizer, and packaging is unchanged. Only the eval-authoring step is replaced. At a high level, the process of creating a skill goes like this: - Decide what you want the skill to do and roughly how it should do it - Write a draft of the skill - Create a few test prompts and run claude-with-access-to-the-skill on them - Help the user evaluate the results both qualitatively and quantitatively - While the runs happen in the background, draft some quantitative evals if there aren't any (if there are some, you can either use as is or modify if you feel something needs to change about them). Then explain them to the user (or if they already existed, explain the ones that already exist) - Use the `eval-viewer/generate_review.py` script to show the user the results for them to look at, and also let them look at the quantitative metrics - Rewrite the skill based on feedback from the user's evaluation of the results (and also if there are any glaring flaws that become apparent from the quantitative benchmarks) - Repeat until you're