evaluation-methodology

Solid

PluginEval quality methodology — dimensions, rubrics, statistical methods, and scoring formulas. Use this skill when understanding how plugin quality is measured, when interpreting a low score on a specific dimension, when deciding how to improve a skill's triggering accuracy or orchestration fitness, when calibrating scoring thresholds for your marketplace, or when explaining quality badges to external partners like Neon.

AI & Automation 36,649 stars 3968 forks Updated today MIT

Install

View on GitHub

Quality Score: 93/100

Stars 20%

100

Recency 20%

100

Frontmatter 20%

Documentation 15%

100

Issue Health 10%

License 10%

100

Description 5%

100

Skill Content

# Evaluation Methodology This document is the authoritative reference for how PluginEval measures plugin and skill quality. It covers the three evaluation layers, all ten scoring dimensions, the composite formula, badge thresholds, anti-pattern flags, Elo ranking, and actionable improvement tips. Related: [Full rubric anchors](references/rubrics.md) --- ## The Three Evaluation Layers PluginEval stacks three complementary layers. Each layer produces a score between 0.0 and 1.0 for each applicable dimension, and later layers override or blend with earlier ones according to per-dimension blend weights. ### Layer 1 — Static Analysis **Speed:** < 2 seconds. No LLM calls. Deterministic. The static analyzer (`layers/static.py`) runs six sub-checks directly against the parsed SKILL.md: | Sub-check | What it measures | |---|---| | `frontmatter_quality` | Name presence, description length, trigger-phrase quality | | `orchestration_wiring` | Output/input documentation, code block count, orchestrator anti-pattern | | `progressive_disclosure` | Line count vs. sweet-spot (200–600 lines), references/ and assets/ bonuses | | `structural_completeness` | Heading density, code blocks, examples section, troubleshooting section | | `token_efficiency` | MUST/NEVER/ALWAYS density, duplicate-line repetition ratio | | `ecosystem_coherence` | Cross-references to other skills/agents, "related"/"see also" mentions | These six sub-checks feed directly into six of the ten final dimensions (via `...

Details

Author: wshobson
Repository: wshobson/agents
Created: 10 months ago
Last Updated: today
Language: Python
License: MIT

Integrates with

Anthropic · AI

Similar Skills

Semantically similar based on skill content — not just same category

AI & Automation Listed

evaluate-plugin

Evaluate plugin quality. Use when user says "evaluate plugin", "review plugin quality", "score my plugin", "check plugin", "rate plugin".

1 Updated 3 days ago

fabioc-aloha

AI & Automation Listed

evaluate

Comprehensive quality grading. Checks prompt compliance, code quality, security, test coverage, architecture fitness. Produces a percentage score. Not lenient. Keywords: evaluate, grade, check, verify, validate, scorecard, quality, percentage, score, how good

2 Updated 3 days ago

jvalin17

AI & Automation Listed

skill-evaluator

Evaluates agent skills against Anthropic's best practices. Use when asked to review, evaluate, assess, or audit a skill for quality. Analyzes SKILL.md structure, naming conventions, description quality, content organization, and identifies anti-patterns. Produces actionable improvement recommendations.

389 Updated 5 months ago

gotalab