agent-evaluation

Solid

Evaluate LLM agents and tool-using workflows—task success, tool accuracy, latency/cost, safety, and regression suites. Use when shipping agent features, comparing prompts/models, or debugging agent failures.

AI & Automation 22 stars 8 forks Updated 6 days ago MIT

Install

View on GitHub

Quality Score: 82/100

Stars 20%

Recency 20%

100

Frontmatter 20%

Documentation 15%

Issue Health 10%

License 10%

100

Description 5%

100

Skill Content

# Agent evaluation ## What to measure | Dimension | Examples | |-----------|----------| | Task success | End state matches spec (binary or rubric) | | Tool use | Correct tool, valid args, no spurious calls | | Safety | No policy violations, no secret leakage | | Efficiency | Tokens, latency, tool call count | | Stability | Same input -> consistent outcome across runs | ## Workflow 1. **Define tasks** — realistic user intents with clear pass/fail or scored rubric. 2. **Build dataset** — golden set + edge cases (errors, ambiguous input, empty context). 3. **Run baseline** — fixed model/settings; log traces (inputs, tools, outputs). 4. **Score** — automated checks first; human review for ambiguous cases. 5. **Compare** — A/B prompts, models, or tool schemas; report deltas with confidence notes. 6. **Gate** — block release on regression in must-pass tasks. ## Automated checks - Schema validation on tool arguments. - Assert final answer contains required fields or avoids forbidden content. - Snapshot tests for deterministic sub-steps where possible. ## Human rubric (when needed) Score 1-5 on: correctness, completeness, tone, safety. Document disagreements. ## Anti-patterns - Eval only on cherry-picked happy paths. - Changing task and model simultaneously without isolation. - No trace logs when debugging tool failures. ## Output Summary table: variant | success rate | avg tools | avg latency | notes.

Details

Author: charlieviettq
Repository: charlieviettq/awesome-agent-skill
Created: 2 months ago
Last Updated: 6 days ago
Language: Python
License: MIT

Similar Skills

Semantically similar based on skill content — not just same category

AI & Automation Listed

agent-evaluation

Use when evaluating an AI agent — task completion, tool-use correctness, trajectory scoring, automation rate, and human-in-the-loop review. Triggers on "agent evaluation", "agent eval", "task completion rate", "tool-use accuracy", "trajectory", "automation rate".

0 Updated today

noctua84

AI & Automation Listed

agent-evaluation

Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and production monitoring—where even top agents achieve less than 50% on re...

5 Updated 3 days ago

rootcastleco

AI & Automation Listed

measure-agent-task-success

Use this to measure whether an AI agent actually completed its task end to end, not just whether individual LLM calls looked fine. Trigger on "is my agent working", "measure agent success rate", "evaluate my agent", "how good is my agent", "agent completion rate", or evaluating a multi-step/tool-using agent. Score the outcome of the whole task, plus the path it took.

26 Updated today

ContextJet-ai