evaluate
SolidEvaluates RAG retrieval and LLM-as-judge metrics (faithfulness, relevancy, context precision). Triggers: measure RAG quality, knowledge gap, RAG eval, golden dataset.
Install
Quality Score: 93/100
Skill Content
Details
- Author
- softspark
- Repository
- softspark/ai-toolkit
- Created
- 2 months ago
- Last Updated
- 2 days ago
- Language
- Python
- License
- MIT
Integrates with
Similar Skills
Semantically similar based on skill content — not just same category
rag-eval-guardrails
Build a verified eval harness for a RAG/LLM feature plus PII/PHI-leakage guardrails, gated by checks that actually run. Scores a precomputed predictions file (so it runs with ZERO API access) on groundedness, citation validity, retrieval hit@k, answer F1/exact-match, refusal rate, and latency; compares to config thresholds and a baseline to catch regressions; and fails the build on PII/PHI leakage. Use when the user wants to evaluate or regression-test an AI/RAG feature, measure hallucination/groundedness, add an eval gate to CI, or scan prompts/answers/logs for leaked identifiers. Triggers: "RAG evaluation", "LLM eval", "eval harness", "hallucination", "groundedness", "PII/PHI leakage", "guardrails", "regression testing for AI features".
llm-eval
LLM evaluation: build evaluation datasets, choose metrics (RAGAS, G-Eval, LLM-as-judge), run automated evals, monitor production quality, and detect regressions
evaluating-llms
Evaluate LLM systems using automated metrics, LLM-as-judge, and benchmarks. Use when testing prompt quality, validating RAG pipelines, measuring safety (hallucinations, bias), or comparing models for production deployment.