← ClaudeAtlas

eval-diagnoselisted

Use when Galileo evidence is available and the user asks why a trace, session, log stream, experiment, metric, or AI app behavior failed, regressed, or became unsafe.
Galileo-Agent-Labs/eval-engineer · ★ 33 · AI & Automation · score 80
Install: claude install-skill Galileo-Agent-Labs/eval-engineer
# Eval Diagnose Use this skill for evidence-backed RCA once a packet, URL-derived evidence, or trace/session/log-stream context is available. ## Required Reference Use `skills/eval-engineer/references/rca-recipe.md`, `skills/eval-engineer/references/debug-packets.md`, `skills/eval-engineer/references/evidence-provenance.md`, and `skills/eval-engineer/assets/diagnosis-template.md`. ## Do - Start from fetched evidence, not source-code guesses. - Name the failing metric contract and what it proves. - Label hosted Galileo evidence separately from local deterministic packets before making metric or score claims. - Inspect traces, spans, sessions, tool calls, retrieval context, and scorer status to classify the fix surface. - Classify the fix surface: prompt, tool schema, adapter, retriever, ranker, guardrail, metric, dataset, or SDK wiring. - Write diagnosis and bounded fix plan only when evidence supports it. - Honor read-only requests. If the user says read-only, dry run, no edits, or "do not edit files", do not write `.galileo/` artifacts. Return the RCA inline and include a short "Would write" list for any suggested artifact paths. ## Gotchas - Fetched debug packets are the RCA source of truth when scorer jobs are still settling or runner output disagrees with fetched metrics. - A prompt diff, local score, or code diff is not proof of improvement without before/after Galileo evidence. - Bare correctness or factuality can be a smoke test only. Prefer the