self-eval

Solid

Honestly evaluate AI work quality using a two-axis scoring system. Use after completing a task, code review, or work session to get an unbiased assessment. Detects score inflation, forces devil's advocate reasoning, and persists scores across sessions.

AI & Automation 16,782 stars 2310 forks Updated 3 days ago MIT

Install

View on GitHub

Quality Score: 96/100

Stars 20%

100

Recency 20%

100

Frontmatter 20%

Documentation 15%

100

Issue Health 10%

License 10%

100

Description 5%

100

Skill Content

# Self-Eval: Honest Work Evaluation ultrathink **Tier:** STANDARD **Category:** Engineering / Quality **Dependencies:** None (prompt-only, no external tools required) ## Description Self-eval is a Claude Code skill that produces honest, calibrated work evaluations. It replaces the default AI tendency to rate everything 4/5 with a structured two-axis scoring system, mandatory devil's advocate reasoning, and cross-session anti-inflation detection. The core insight: AI self-assessment converges to "everything is a 4" because a single-axis score conflates task difficulty with execution quality. Self-eval separates these axes, then combines them via a fixed matrix that the model cannot override. ## Features - **Two-axis scoring** — Independently rates task ambition (Low/Medium/High) and execution quality (Poor/Adequate/Strong), then combines via a lookup matrix - **Mandatory devil's advocate** — Before finalizing, must argue for both higher AND lower scores, then resolve the tension - **Score persistence** — Appends scores to `.self-eval-scores.jsonl` in the working directory, building history across sessions - **Anti-inflation detection** — Reads past scores and flags clustering (4+ of last 5 identical) - **Matrix-locked scoring** — The composite score comes from the matrix, not from direct selection. Low ambition caps at 2/5 regardless of execution quality ## Usage After completing work in a Claude Code session: ``` /self-eval ``` With context about what to evaluate: ...

Details

Author: alirezarezvani
Repository: alirezarezvani/claude-skills
Created: 7 months ago
Last Updated: 3 days ago
Language: Python
License: MIT

Integrates with

OpenAI · AI Anthropic · AI

Similar Skills

Semantically similar based on skill content — not just same category

AI & Automation Listed

evaluate

Comprehensive quality grading. Checks prompt compliance, code quality, security, test coverage, architecture fitness. Produces a percentage score. Not lenient. Keywords: evaluate, grade, check, verify, validate, scorecard, quality, percentage, score, how good

2 Updated today

jvalin17

AI & Automation Listed

agentic-eval

Evaluate and improve AI-generated output with explicit rubrics, reflection loops, and stop conditions. Use when building self-critique workflows, evaluator-optimizer pipelines, or acceptance gates for code, docs, analysis, or plans.

1 Updated today

bg-szy

AI & Automation Solid

eval-skills

Audit all skills in the current project for frontmatter completeness, effort level appropriateness, allowed-tools scoping, and content quality. Produces a scored report with effort-level recommendations for each skill. Use when onboarding to a new project, reviewing skill quality before shipping, or adding effort fields to an existing skill library.

4,608 Updated 2 days ago

FlorianBruniaux

AI & Automation Listed

ai-evals

Help users create and run AI evaluations. Use when someone is building evals for LLM products, measuring model quality, creating test cases, designing rubrics, or trying to systematically measure AI output quality.

0 Updated today

TindanLawrence

AI & Automation Listed

ai-reliability-eval

Measures AI system reliability over time by defining pass/fail criteria before implementation, running capability checks, and tracking regression via pass@k metrics. Trigger for 'how reliable is this', 'did my changes break anything', 'measure AI performance', 'define success criteria', 'eval this feature', 'check skill regression'. Not for code correctness; use /ai-test instead. Not for quality gates; use /ai-verify instead — evals measure AI task completion consistency.

49 Updated today

arcasilesgroup