evaluate

Solid

Evaluates RAG retrieval and LLM-as-judge metrics (faithfulness, relevancy, context precision). Triggers: measure RAG quality, knowledge gap, RAG eval, golden dataset.

AI & Automation 155 stars 19 forks Updated 2 days ago MIT

Install

View on GitHub

Quality Score: 93/100

Stars 20%
73
Recency 20%
100
Frontmatter 20%
70
Documentation 15%
100
Issue Health 10%
80
License 10%
100
Description 5%
100

Skill Content

# RAG Evaluation Evaluate RAG quality using LLM-as-a-Judge methodology. ## Usage ``` /evaluate [--threshold 0.7] ``` ## Execution ### Direct Execution (recommended for most projects) ```bash # Run RAG evaluation python3 scripts/evaluate_rag.py # With custom thresholds python3 scripts/evaluate_rag.py \ --faithfulness 0.7 \ --relevancy 0.7 \ --context 0.6 # Detect knowledge gaps python3 scripts/knowledge_gaps.py --detect # Generate gap report python3 scripts/knowledge_gaps.py --report ``` ### Docker Execution (containerized projects) ```bash # Replace {api-container} with your API server container name docker exec {api-container} python3 scripts/evaluate_rag.py # With custom thresholds docker exec {api-container} python3 scripts/evaluate_rag.py \ --faithfulness 0.7 \ --relevancy 0.7 \ --context 0.6 # Detect knowledge gaps docker exec {api-container} python3 scripts/knowledge_gaps.py --detect # Generate gap report docker exec {api-container} python3 scripts/knowledge_gaps.py --report ``` ## Metrics | Metric | Description | Target | |--------|-------------|--------| | **Faithfulness** | Is answer based on context? | >70% | | **Relevancy** | Does answer address question? | >70% | | **Context Precision** | Is found context accurate? | >60% | ## Evaluation Process 1. **Generate test queries** from golden dataset 2. **Execute RAG pipeline** for each query 3. **LLM judges** each response on metrics 4. **Report** aggregate scores ## Golden Dataset Located ...

Details

Author
softspark
Repository
softspark/ai-toolkit
Created
2 months ago
Last Updated
2 days ago
Language
Python
License
MIT

Integrates with

Similar Skills

Semantically similar based on skill content — not just same category