evallisted

Use to measure whether memory grounding actually improves PR quality — shells out to `kagura-engineer eval <issues...>`, running the SAME fixed issue set in two arms (recall ON vs OFF) and printing an A/B table. HARNESS — high cost; it runs the full loop twice per issue and mutates the repo. Confirm with the user before launching.
kagura-ai/kagura-engineer · ★ 0 · AI & Automation · score 70

Install: claude install-skill kagura-ai/kagura-engineer

# kagura-engineer: eval Thin wrapper around the `kagura-engineer eval` CLI verb (moat lever M3). It drives the **same** fixed issue set through two arms — **grounded** (the normal `run` loop with recall + pinned + graph-expanded memory) and **control** (the identical loop with grounding disabled) — and prints an A/B table on objective signals already in the pipeline: PR-reached rate, gate-verdict rate, and (with `--review`) review findings + re-fix-loop iterations. It reimplements nothing — the orchestration lives in the CLI; this skill discovers config, gates on `doctor`, shells out, and surfaces the result. > ⚠️ **This is a Harness.** Before launching, confirm the user understands: > - **Repo mutation × 2N** — the full run loop runs twice per issue (a worktree, commits, > and a PR per arm). With `--review` the auto-fix loop also mutates each arm's branch. > - **Cost** — `run`'s `claude -p` budget, doubled across both arms of every issue. > - **Disposable issue set** — run it on a pinned, throwaway issue set, not production work. **Announce:** "Using the kagura-engineer:eval skill — this is a Harness that runs the loop twice per issue." ## No-argument usage If invoked without an issue set, do NOT guess or shell out — print this and stop: ``` kagura-engineer:eval <issue> [<issue> ...] Measure memory-grounded uplift: run the same issues with recall ON vs OFF, print an A/B table. ⚠ Harness (high cost): the full run loop runs twice per issue; mutates the repo. Exam