eval-run

Solid

Execute or supervise a planned Mnemon harness eval run in an isolated HostAgent workspace.

AI & Automation 322 stars 46 forks Updated today Apache-2.0

Install

View on GitHub

Quality Score: 88/100

Stars 20%
84
Recency 20%
100
Frontmatter 20%
70
Documentation 15%
47
Issue Health 10%
50
License 10%
100
Description 5%
100

Skill Content

# Eval Run Use this skill to execute or supervise a planned eval run. ## Procedure 1. Confirm the plan names a host, suite or scenario, and evidence targets. 2. Create or use an isolated workspace. Do not run scenario state in the developer's active workspace unless the eval explicitly requires it. 3. Install the requested loop templates with `harness/ops`. 4. For Codex app-server evals, use the project runner when available: ```bash python3 scripts/codex_app_server_eval.py --suite ``` Use a specific suite option when the scenario requires it. 5. Collect artifacts and logs before cleanup. 6. Record timeouts, setup failures, and HostAgent readiness failures as eval evidence, not as silent skips. ## Boundaries - Do not change canonical scenarios, suites, or rubrics while running an eval. - Do not delete artifacts needed for report review. - Do not treat an exploratory run as a regression result.

Details

Author
mnemon-dev
Repository
mnemon-dev/mnemon
Created
3 months ago
Last Updated
today
Language
Go
License
Apache-2.0

Integrates with

Similar Skills

Semantically similar based on skill content — not just same category

AI & Automation Solid

eval-plan

Design a scenario-driven Mnemon harness eval with target, hypothesis, HostAgent, loop configuration, evidence, and rubric.

322 Updated today
mnemon-dev
AI & Automation Listed

eval-runner

Run eval scenarios to benchmark Mycelium effectiveness. Execute tasks using reflexion loop, validate against success criteria, record metrics.

30 Updated today
haabe
AI & Automation Solid

eval-harness

Formal evaluation framework for Claude Code sessions implementing eval-driven development (EDD) principles

199,470 Updated yesterday
affaan-m
AI & Automation Listed

eval-harness

Formal evaluation framework for Claude Code sessions implementing eval-driven development (EDD) principles

0 Updated yesterday
uzysjung
AI & Automation Listed

genesis-evals

Use this skill to run the genesis maintainer-side eval suite against a target model (default: claude-opus-4.7). Activate when validating a genesis PR, when changing the genesis catalogue (architectural-patterns, primitives, design-patterns, refactor-patterns, composition-substrate, pattern-tradeoffs, SKILL.md), or when the operator asks to "run evals" or "regenerate the eval matrix". This skill orchestrates parallel cold sub-agent spawns via the harness's task tool, scores deterministically, and converges P>=0.8 / N>=0.8 / R==1.0 within max 3 iteration loops. This skill is contributor-only -- it lives under dev/skills/ (OUTSIDE .apm/) and is NOT shipped inside the user-facing skills/genesis/ bundle (BUNDLE LEAKAGE discipline). See "Why this lives outside .apm/" below.

28 Updated 2 days ago
danielmeppiel