evaluate

Featured

Evaluates RAG retrieval and LLM-as-judge metrics (faithfulness, relevancy, context precision). Triggers: measure RAG quality, knowledge gap, RAG eval, golden dataset.

AI & Automation 161 stars 21 forks Updated yesterday Apache-2.0

Install

View on GitHub

Quality Score: 93/100

Stars 20%

Recency 20%

100

Frontmatter 20%

Documentation 15%

100

Issue Health 10%

License 10%

100

Description 5%

100

Skill Content

# RAG Evaluation Evaluate RAG quality using LLM-as-a-Judge methodology. ## Usage ``` /evaluate [--threshold 0.7] ``` ## Execution ### Direct Execution (recommended for most projects) ```bash # Run RAG evaluation python3 scripts/evaluate_rag.py # With custom thresholds python3 scripts/evaluate_rag.py \ --faithfulness 0.7 \ --relevancy 0.7 \ --context 0.6 # Detect knowledge gaps python3 scripts/knowledge_gaps.py --detect # Generate gap report python3 scripts/knowledge_gaps.py --report ``` ### Docker Execution (containerized projects) ```bash # Replace {api-container} with your API server container name docker exec {api-container} python3 scripts/evaluate_rag.py # With custom thresholds docker exec {api-container} python3 scripts/evaluate_rag.py \ --faithfulness 0.7 \ --relevancy 0.7 \ --context 0.6 # Detect knowledge gaps docker exec {api-container} python3 scripts/knowledge_gaps.py --detect # Generate gap report docker exec {api-container} python3 scripts/knowledge_gaps.py --report ``` ## Metrics | Metric | Description | Target | |--------|-------------|--------| | **Faithfulness** | Is answer based on context? | >70% | | **Relevancy** | Does answer address question? | >70% | | **Context Precision** | Is found context accurate? | >60% | ## Evaluation Process 1. **Generate test queries** from golden dataset 2. **Execute RAG pipeline** for each query 3. **LLM judges** each response on metrics 4. **Report** aggregate scores ## Golden Dataset Located ...

Details

Author: softspark
Repository: softspark/ai-toolkit
Created: 4 months ago
Last Updated: yesterday
Language: Python
License: Apache-2.0

Integrates with

Anthropic · AI Docker · Infrastructure

Similar Skills

Semantically similar based on skill content — not just same category

AI & Automation Listed

monitor-rag-quality

Use this to measure and monitor the quality of a RAG (retrieval-augmented generation) pipeline - whether it retrieves the right context and answers faithfully. Trigger on "my RAG gives wrong answers", "is my retrieval any good", "the chatbot makes things up", "evaluate my RAG", "improve RAG accuracy". Diagnose whether the failure is in retrieval or generation - they need different fixes.

26 Updated yesterday

ContextJet-ai

AI & Automation Listed

rag-evaluation

Measure retrieval and generation separately against a judged set, so you know whether a wrong answer came from the search or the model. Use when a RAG system is unreliable and every fix is a guess.

4 Updated today

Amey-Thakur

AI & Automation Listed

add-llm-evals

Use this when adding evaluation to an LLM/agent app - measuring output quality (correctness, faithfulness, relevance, safety) rather than just watching traces. Trigger on "add evals", "test my prompt", "is my RAG accurate", "catch regressions", "score outputs", or setting up an eval suite in CI. Covers offline (CI) and online (production LLM-as-a-judge) evaluation.

26 Updated yesterday

ContextJet-ai