← ClaudeAtlas

tool-use-evallisted

Evaluation pattern for agents that call tools / functions - measures tool selection accuracy, argument correctness, recovery from tool errors, and avoidance of unnecessary tool calls. Pairs with eval-suite-planner. Use when: tool-use eval, function-calling eval, agent tool selection, tool invocation accuracy, tool argument validation, tool error recovery, agent tools, function-calling agent, MCP eval.
varunk130/AI-Eval-Skills · ★ 1 · AI & Automation · score 72
Install: claude install-skill varunk130/AI-Eval-Skills
# Tool-Use Eval Evaluation pattern for agents whose job is to invoke tools / functions / MCP capabilities. Conversational quality and task-completion evals miss the failure modes specific to tool-use; this skill adds the four dimensions that actually predict whether a tool-using agent is production-ready. ## Core Principle **A tool-using agent has four ways to fail that pure-conversation agents don't.** Generic quality scores can be high while every one of these is broken. This eval pattern separates them so improvements target the right failure mode. ## The Four Tool-Use Dimensions | Dimension | What It Measures | Failure Example | |-----------|------------------|-----------------| | **Tool Selection Accuracy** | Did the agent pick the right tool for the request? | Used `search_web` when `lookup_internal_kb` was the right call | | **Argument Correctness** | Were the arguments well-formed and schema-valid? | Required field missing, type mismatch, hallucinated parameter | | **Error Recovery** | When a tool returned an error, did the agent recover sensibly? | Repeats the same failing call, or gives up silently | | **Restraint** | Did the agent avoid unnecessary tool calls? | Calls 5 tools when 1 would do; calls a tool when no tool was needed | The four dimensions are **independently** scorable - improvements often help one and hurt another (e.g., tightening argument validation can increase Restraint failures). ## Scoring Anchors (0-3 per dimension) ### Tool Selection Ac