openjudge

Solid

Build custom LLM evaluation pipelines using the OpenJudge framework. Covers selecting and configuring graders (LLM-based, function-based, agentic), running batch evaluations with GradingRunner, combining scores with aggregators, applying evaluation strategies (voting, average), auto-generating graders from data, and analyzing results (pairwise win rates, statistics, validation metrics). Use when the user wants to evaluate LLM outputs, compare multiple models, design scoring criteria, or build an automated evaluation system.

AI & Automation 633 stars 54 forks Updated 3 days ago Apache-2.0

Install

View on GitHub

Quality Score: 92/100

Stars 20%

Recency 20%

100

Frontmatter 20%

Documentation 15%

100

Issue Health 10%

License 10%

100

Description 5%

100

Skill Content

# OpenJudge Skill Build evaluation pipelines for LLM applications using the `openjudge` library. ## When to Use This Skill - User wants to evaluate LLM output quality (correctness, relevance, hallucination, etc.) - User wants to compare two or more models and rank them - User wants to design a scoring rubric and automate evaluation - User wants to analyze evaluation results statistically - User wants to build a reward model or quality filter ## Sub-documents — Read When Relevant | Topic | File | Read when… | |-------|------|------------| | Grader selection & configuration | `graders.md` | User needs to pick or configure an evaluator | | Batch evaluation pipeline | `pipeline.md` | User needs to run evaluation over a dataset | | Auto-generate graders from data | `generator.md` | No rubric yet; generate from labeled examples | | Analyze & compare results | `analyzer.md` | User wants win rates, statistics, or metrics | Read the relevant sub-document **before** writing any code. ## Install ```bash pip install py-openjudge ``` ## Architecture Overview ``` Dataset (List[dict]) │ ▼ GradingRunner ← orchestrates everything │ ├─► Grader A ──► EvaluationStrategy ──► _aevaluate() ──► GraderScore / GraderRank ├─► Grader B ──► EvaluationStrategy ──► _aevaluate() ──► GraderScore / GraderRank └─► Grader C ... │ ├─► Aggregator (optional) ← combine multiple grader scores into one │ └─► RunnerResult ← {grader_nam...

Details

Author: agentscope-ai
Repository: agentscope-ai/OpenJudge
Created: 10 months ago
Last Updated: 3 days ago
Language: Python
License: Apache-2.0

Similar Skills

Semantically similar based on skill content — not just same category

AI & Automation Listed

evaluating-llms

Evaluate LLM systems using automated metrics, LLM-as-judge, and benchmarks. Use when testing prompt quality, validating RAG pipelines, measuring safety (hallucinations, bias), or comparing models for production deployment.

368 Updated 5 months ago

ancoleman

AI & Automation Solid

auto-arena

Automatically evaluate and compare multiple AI models or agents without pre-existing test data. Generates test queries from a task description, collects responses from all target endpoints, auto-generates evaluation rubrics, runs pairwise comparisons via a judge model, and produces win-rate rankings with reports and charts. Supports checkpoint resume, incremental endpoint addition, and judge model hot-swap. Use when the user asks to compare, benchmark, or rank multiple models or agents on a custom task, or run an arena-style evaluation.

633 Updated 3 days ago

agentscope-ai

AI & Automation Featured

advanced-evaluation

This skill should be used when the user asks to "implement LLM-as-judge", "compare model outputs", "create evaluation rubrics", "mitigate evaluation bias", or mentions direct scoring, pairwise comparison, position bias, evaluation pipelines, or automated quality assessment.

39,350 Updated today

sickn33

AI & Automation Listed

advanced-evaluation

0 Updated today

mytricker0

AI & Automation Listed

advanced-evaluation

3 Updated today

Kalyanikhandare29