eval-improve

Solid

Turn stable Mnemon harness eval findings into scoped project, loop, adapter, docs, or eval asset improvements.

AI & Automation 322 stars 46 forks Updated today Apache-2.0

Install

View on GitHub

Quality Score: 88/100

Stars 20%

Recency 20%

100

Frontmatter 20%

Documentation 15%

Issue Health 10%

License 10%

100

Description 5%

100

Skill Content

# Eval Improve Use this skill to turn stable eval findings into project changes. ## Procedure 1. Confirm the finding is backed by a report or repeated observation. 2. Pick one improvement target. Avoid mixing loop policy changes, runner changes, docs changes, and scenario promotion in one patch unless they are tightly coupled. 3. For eval asset changes: - keep exploratory ideas in scratch - add candidate assets under runtime candidates - promote canonical repo assets only after curation 4. For code or harness changes, run the narrowest relevant eval or validation. 5. Summarize what changed, which evidence motivated it, and what remains unproven. ## Promotion Checklist Before making an eval asset canonical, verify: - It has a clear target and hypothesis. - It has an explicit rubric. - It produces reviewable artifacts. - It is not duplicative. - It is stable enough for its intended suite. - It does not reward weak or unsafe behavior.

Details

Author: mnemon-dev
Repository: mnemon-dev/mnemon
Created: 3 months ago
Last Updated: today
Language: Go
License: Apache-2.0

Integrates with

SQLite · Database

Similar Skills

Semantically similar based on skill content — not just same category

AI & Automation Solid

eval-analyze

Analyze Mnemon harness eval reports, classify outcomes, and extract improvement evidence.

322 Updated today

mnemon-dev

AI & Automation Solid

eval-plan

Design a scenario-driven Mnemon harness eval with target, hypothesis, HostAgent, loop configuration, evidence, and rubric.

322 Updated today

mnemon-dev

AI & Automation Listed

eval-triage-and-improvement

Use this skill when AI agent evaluations have come back and the user needs to interpret scores, diagnose root causes of underperforming test cases, find remediation steps, or analyze patterns to improve their agent. Works against any agent platform - Copilot Studio is the primary worked example here, but the triage framework applies equally to custom harnesses, LangChain/LangGraph, AutoGen, Semantic Kernel, OpenAI Assistants, and other agent runtimes. Always use this skill when the user mentions: "eval failed", "why did this fail", "triage", "diagnose failure", "low pass rate", "fix evaluation results", "not passing", "failing test cases", "evaluation results", "improve my eval scores", or any situation where eval scores need interpretation and action.

1 Updated today

varunk130

AI & Automation Solid

eval-harness

Formal evaluation framework for Claude Code sessions implementing eval-driven development (EDD) principles

199,470 Updated yesterday

affaan-m

AI & Automation Listed

eval-harness

Formal evaluation framework for Claude Code sessions implementing eval-driven development (EDD) principles

0 Updated yesterday

uzysjung