rl-reward

Solid

Build RL reward signals using the OpenJudge framework. Covers choosing between pointwise and pairwise reward strategies based on RL algorithm, task type, and cost; aggregating multi-dimensional pointwise scores into a scalar reward; pairwise tournament reward for GRPO on subjective tasks (net win rate across group rollouts); generating preference pairs for DPO/RLAIF; and normalizing scores for training stability. Use when building reward models, scoring rollouts for GRPO/REINFORCE, generating preference data for DPO, or doing Best-of-N selection.

AI & Automation 633 stars 54 forks Updated 3 days ago Apache-2.0

Install

View on GitHub

Quality Score: 92/100

Stars 20%

Recency 20%

100

Frontmatter 20%

Documentation 15%

100

Issue Health 10%

License 10%

100

Description 5%

100

Skill Content

# RL Reward Construction with OpenJudge Build reward signals for reinforcement learning from human feedback (RLHF) and reinforcement learning from AI feedback (RLAIF) using the `openjudge` library. ## When to Use This Skill - Building scalar rewards for GRPO / REINFORCE rollout scoring - Generating (chosen, rejected) preference pairs for DPO / IPO - Best-of-N candidate selection - Multi-dimensional reward shaping (correctness + safety + format) - Replacing or bootstrapping a reward model with LLM-as-judge ## Step 1 — Choose Your Reward Strategy Use this decision tree **before** writing any code: ``` RL Algorithm + Task type? │ ├── GRPO / REINFORCE — Verifiable task (math, code, structured output) │ └── → POINTWISE ✅ (FunctionGrader, exact score, zero LLM cost) │ ├── GRPO / REINFORCE — Subjective task (instruction following, dialogue, summarization) │ └── → PAIRWISE TOURNAMENT ✅ (compare each rollout vs all others in group, │ reward = net win rate within group) │ ├── DPO / IPO / SLiC — need (chosen, rejected) pairs │ └── → PAIRWISE ✅ (two-way comparison, return winner/loser) │ └── Best-of-N / reranking — rank N candidates └── → LISTWISE ✅ (single call ranks all N at once) ``` ``` Cost constraint? ├── Low budget │ └── FunctionGrader (free) → pointwise; or pairwise with small judge model │ ├── Medium budget │ └── Pointwise: 2–3 LLM graders + WeightedSumAggregator │ └── Pairwise tournament: 1 LLM judge, N*(N-1)/2 c...

Details

Author: agentscope-ai
Repository: agentscope-ai/OpenJudge
Created: 10 months ago
Last Updated: 3 days ago
Language: Python
License: Apache-2.0

Similar Skills

Semantically similar based on skill content — not just same category

AI & Automation Solid

openjudge

Build custom LLM evaluation pipelines using the OpenJudge framework. Covers selecting and configuring graders (LLM-based, function-based, agentic), running batch evaluations with GradingRunner, combining scores with aggregators, applying evaluation strategies (voting, average), auto-generating graders from data, and analyzing results (pairwise win rates, statistics, validation metrics). Use when the user wants to evaluate LLM outputs, compare multiple models, design scoring criteria, or build an automated evaluation system.

633 Updated 3 days ago

agentscope-ai

AI & Automation Solid

grpo-rl-training

Expert guidance for GRPO/RL fine-tuning with TRL for reasoning and task-specific model training

9,182 Updated 1 months ago

Orchestra-Research

AI & Automation Featured

grpo-rl-training

Expert guidance for GRPO/RL fine-tuning with TRL for reasoning and task-specific model training

27,705 Updated today

davila7

AI & Automation Solid

grpo-rl-training

Expert guidance for GRPO/RL fine-tuning with TRL for reasoning and task-specific model training

175,435 Updated today

NousResearch

AI & Automation Featured

fine-tuning-with-trl

Fine-tune LLMs using reinforcement learning with TRL - SFT for instruction tuning, DPO for preference alignment, PPO/GRPO for reward optimization, and reward model training. Use when need RLHF, align model with preferences, or train from human feedback. Works with HuggingFace Transformers.

27,705 Updated today

davila7