mkevaluate

Solid

Experimental behavioral evaluator. Drives a running artifact via browser/curl/CLI and records rubric evidence; the runner is not yet a fully implemented automated evaluation system.

AI & Automation 15 stars 2 forks Updated today MIT

Install

View on GitHub

Quality Score: 86/100

Stars 20%

Recency 20%

100

Frontmatter 20%

Documentation 15%

100

Issue Health 10%

License 10%

100

Description 5%

100

Skill Content

# mk:evaluate — Experimental Behavioral Verification Step-file workflow that drives a running build, probes each rubric criterion via active verification, and produces a graded verdict with runtime evidence. Owned by the `evaluator` agent (Phase 3+). ## When to Use Activate when: - User runs `/mk:evaluate <target>` with a URL, file path, or running-app handle - A generator iteration completes and the harness needs a graded verdict - After Phase 3 (build) and before Phase 5 (ship) for frontend/fullstack/CLI products - When asked to "grade the running app", "check the build behaviorally", or "verify against the spec" Skip when: - The build has no runnable artifact (pure library, type-only package) - The task is structural code review only — use `mk:review` instead - The task is `/mk:fix` simple — overhead exceeds value ## Hard Constraints 1. **Active verification gate** — every verdict MUST include non-empty `evidence/` directory with at least one of: screenshot, HTTP response capture, CLI stdout+exit-code transcript. `validate-verdict.sh` rejects PASS verdicts with empty evidence and converts them to FAIL. 2. **Skeptic persona enforced** — load `prompts/skeptic-persona.md` at session start. Re-anchor before each criterion grading. 3. **Max 15 criteria per session** — split into multiple sessions if rubric composition exceeds. Heuristic: context overflow risk above this threshold. 4. **No source code edits** — evaluator owns `tasks/reviews/*-evalverdict.md` only. Never mo...

Details

Author: ngocsangyem
Repository: ngocsangyem/MeowKit
Created: 4 months ago
Last Updated: today
Language: TypeScript
License: MIT

Bundled in these plugins

MeowKit

Similar Skills

Semantically similar based on skill content — not just same category

AI & Automation Listed

evaluate

Use for repeatable quality or safety evaluation of stochastic or judgement-bearing systems, prompts, agents, rankings, or artifacts. Not for deterministic tests or ordinary code review; use tdd or code-review.

2 Updated today

mblauberg

AI & Automation Listed

evaluator

Rigorous code and strategy auditor.

10 Updated 1 weeks ago

samibs

AI & Automation Listed

eval-engine

Iterate-stage skill: turns a spec into the complete runnable verification layer — binary gates, anchored rubric, paste-ready judge prompt, and harness instructions — in one pass. Use when a feature needs its full eval built — 'create an eval for this spec', 'build the verification layer', 'spec to eval harness', 'turn this PRD into something we can run outputs through' — or when /pm routes such a request here. Do NOT use for the gates+rubric design artifact alone (prd-to-eval), for judge prompts over existing criteria (llm-as-judge-designer), for executing an eval over outputs, or for eval definitions.

1 Updated 6 days ago

Abhillashjadhav