eval-judgelisted
Install: claude install-skill Luis247911/universal-ai-workspace-foundation
# eval-judge
The shared **LLM-as-judge**. Given one output and a rubric, it returns PASS/FAIL with a score.
Offline it grades deterministically by rubric-keyword overlap (so CI is green without a key);
under `UAW_LLM=live` it asks a real model with a reason-then-decide prompt. It is the single
judge that [[eval-loop-builder]] and [[orchestrator-patterns]] both rely on — not a second copy.
## When to use
- Grading a subjective quality that no `exact`/`regex` check can express (tone, completeness, helpfulness).
- The scoring step inside an evaluator-optimizer loop.
- A one-off "is this answer good enough?" check.
## Run it
```
python -m harness.eval judge --rubric "mentions both cost and latency" --output "use caching to cut latency and cost"
python -m harness.eval judge --rubric "is a polite refusal" --output-file reply.txt
python .claude/skills/eval-judge/scripts/run.py judge --rubric "..." --output "..."
```
Prints JSON `{verdict, score, detail, mock}` and exits non-zero on FAIL.
## How it judges
- **Offline (mock)**: PASS if salient rubric keywords appear in the output. Deterministic — same
inputs, same verdict. Good enough to wire the plumbing and keep CI green.
- **Live (`UAW_LLM=live` + `[llm]` extra + key)**: a strict evaluator prompt — reason internally,
then answer exactly PASS or FAIL. The reasoning is discarded; only the verdict is scored.
## Judge design rules
1. Make the rubric **specific and checkable** ("mentions X and Y"), not vague ("is good").
2