agent-benchmark

Solid

Framework for measuring and tracking agent response quality over time. Detects regressions before they reach production. Use when evaluating agent changes, auditing quality, or establishing performance baselines.

AI & Automation 519 stars 44 forks Updated 1 weeks ago MIT

Install

View on GitHub

Quality Score: 89/100

Stars 20%

Recency 20%

Frontmatter 20%

Documentation 15%

100

Issue Health 10%

License 10%

100

Description 5%

100

Skill Content

# Agent Benchmark Framework Without benchmarks, we cannot know whether agent changes improve or degrade quality. This skill defines how to measure, track, and protect agent performance. ## When to Activate - Before and after modifying any agent definition file - When adding a new skill that an agent depends on - Periodic quality audits (weekly/monthly) - When a user reports degraded agent output - Before promoting an agent from experimental to production ## Core Concepts ### Why Benchmarks Matter Agent quality degrades silently. A prompt tweak that improves one response can break ten others. Without a baseline to compare against, every change is a guess. Benchmarks make quality visible and regressions detectable. ### Benchmark Types | Type | Scope | Cost | Frequency | |------|-------|------|-----------| | Prompt Benchmark | Single agent, single task | Low | Every agent change | | Task Benchmark | End-to-end scenario | Medium | Feature changes | | Regression Suite | All critical agents | High | Weekly / before release | ## Directory Structure ``` ~/.claude/benchmarks/ fixtures/ code-reviewer/ missing-error-handling.ts # Input: code with no try/catch sql-injection.py # Input: unparameterized query clean-code.ts # Input: code with no issues security-reviewer/ hardcoded-secret.ts # Input: API key in source parameterized-query.py # Input: safe query (no findings expected) v...

Details

Author: vibeeval
Repository: vibeeval/vibecosystem
Created: 4 months ago
Last Updated: 1 weeks ago
Language: C#
License: MIT

Integrates with

Anthropic · AI

Similar Skills

Semantically similar based on skill content — not just same category

AI & Automation Featured

agent-benchmark-suite

Agent skill for benchmark-suite - invoke with $agent-benchmark-suite

66,323 Updated today

ruvnet

Code & Development Solid

agent-benchmark

Self-benchmark: YOU write the code, adversarial reviews it (multi-provider), you fix, you write tests, adversarial reviews tests, you fix. Measures YOUR quality as an agent. Run in different models (Opus, Sonnet, Haiku) and compare results.

6 Updated today

greglas75

AI & Automation Listed

agent-review-benchmark

Generate evidence-linked Guided Review artifacts and deterministic local Agent benchmark summaries. Use when reviewing Agent-produced diffs by intent, recording structured follow-ups, comparing multiple Agent/model/configuration runs against versioned repository assertions, or preparing reproducible review and benchmark evidence for /ship or Mission validation.

1 Updated today

Kucell