eval-analyze

Solid

Analyze Mnemon harness eval reports, classify outcomes, and extract improvement evidence.

AI & Automation 322 stars 46 forks Updated today Apache-2.0

Install

View on GitHub

Quality Score: 88/100

Stars 20%

Recency 20%

100

Frontmatter 20%

Documentation 15%

Issue Health 10%

License 10%

100

Description 5%

100

Skill Content

# Eval Analyze Use this skill after an eval run to judge behavior and extract improvement evidence. ## Procedure 1. Read the report, relevant artifact summaries, and the selected rubric. 2. Compare observed behavior to the hypothesis. 3. Classify the outcome: - `pass`: behavior meets the rubric. - `weak`: partially useful but missing expected evidence or consistency. - `fail`: behavior contradicts the target expectation. - `invalid`: setup or scenario issue prevents judgement. 4. Identify the likely improvement target: - memory - skill - eval - host adapter - setup - docs - scenario or rubric 5. If a new eval asset is warranted, create a candidate summary instead of editing canonical assets immediately. ## Output Write a concise analysis with: - outcome - evidence - likely cause - recommended next action - candidate eval asset path, if any

Details

Author: mnemon-dev
Repository: mnemon-dev/mnemon
Created: 3 months ago
Last Updated: today
Language: Go
License: Apache-2.0

Integrates with

SQLite · Database

Similar Skills

Semantically similar based on skill content — not just same category

AI & Automation Listed

eval-analysis

Analyze eval results to understand signal quality and guide prompt/config changes. Use when the user says "eval-analysis", "analyze eval results", "what do the evals show", "eval regression", or asks about eval signal enrichment.

66 Updated today

eforge-build

AI & Automation Solid

eval-improve

Turn stable Mnemon harness eval findings into scoped project, loop, adapter, docs, or eval asset improvements.

322 Updated today

mnemon-dev

AI & Automation Solid

eval-plan

Design a scenario-driven Mnemon harness eval with target, hypothesis, HostAgent, loop configuration, evidence, and rubric.

322 Updated today

mnemon-dev

AI & Automation Listed

eval-result-interpreter

Analyzes AI agent evaluation results - primarily from Copilot Studio (the worked example here, via its CSV export) but also from custom harnesses or any evaluator that produces per-case pass/fail rows - using Microsoft's Triage & Improvement Playbook. Returns a SHIP / ITERATE / BLOCK verdict with root cause classification, diagnostic triage, prioritized remediation, and pattern analysis.

1 Updated today

varunk130

Data & Documents Listed

evaluate

This skill should be used when the user asks to "analyze results", "improve strategy", "run PDCA", "evaluate effectiveness", "check response rates", or wants to evaluate sales performance and improve strategy. Automatically analyzes and improves strategy, targeting, and messaging based on response rate data.

1 Updated 1 weeks ago

aitit-inc