ref-hallucination-arena

Solid

Benchmark LLM reference recommendation capabilities by verifying every cited paper against Crossref, PubMed, arXiv, and DBLP. Measures hallucination rate, per-field accuracy (title/author/year/DOI), discipline breakdown, and year constraint compliance. Supports tool-augmented (ReAct + web search) mode. Use when the user asks to evaluate, benchmark, or compare models on academic reference hallucination, literature recommendation quality, or citation accuracy.

AI & Automation 633 stars 54 forks Updated 3 days ago Apache-2.0

Install

View on GitHub

Quality Score: 92/100

Stars 20%

Recency 20%

100

Frontmatter 20%

Documentation 15%

100

Issue Health 10%

License 10%

100

Description 5%

100

Skill Content

# Reference Hallucination Arena Skill Evaluate how accurately LLMs recommend real academic references using the OpenJudge `RefArenaPipeline`: 1. **Load queries** — from JSON/JSONL dataset 2. **Collect responses** — BibTeX-formatted references from target models 3. **Extract references** — parse BibTeX entries from model output 4. **Verify references** — cross-check against Crossref / PubMed / arXiv / DBLP 5. **Score & rank** — compute verification rate, per-field accuracy, discipline breakdown 6. **Generate report** — Markdown report + visualization charts ## Prerequisites ```bash # Install OpenJudge pip install py-openjudge # Extra dependency for ref_hallucination_arena (chart generation) pip install matplotlib ``` ## Gather from user before running | Info | Required? | Notes | |------|-----------|-------| | Config YAML path | Yes | Defines endpoints, dataset, verification settings | | Dataset path | Yes | JSON/JSONL file with queries (can be set in config) | | API keys | Yes | Env vars: `OPENAI_API_KEY`, `DASHSCOPE_API_KEY`, etc. | | CrossRef email | No | Improves API rate limits for verification | | PubMed API key | No | Improves PubMed rate limits | | Output directory | No | Default: `./evaluation_results/ref_hallucination_arena` | | Report language | No | `"en"` (default) or `"zh"` | | Tavily API key | No | Required only if using tool-augmented mode | ## Quick start ### CLI ```bash # Run evaluation with config file python -m cookbooks.ref_hallucination_arena --...

Details

Author: agentscope-ai
Repository: agentscope-ai/OpenJudge
Created: 10 months ago
Last Updated: 3 days ago
Language: Python
License: Apache-2.0

Integrates with

OpenAI · AI

Similar Skills

Semantically similar based on skill content — not just same category

AI & Automation Solid

auto-arena

Automatically evaluate and compare multiple AI models or agents without pre-existing test data. Generates test queries from a task description, collects responses from all target endpoints, auto-generates evaluation rubrics, runs pairwise comparisons via a judge model, and produces win-rate rankings with reports and charts. Supports checkpoint resume, incremental endpoint addition, and judge model hot-swap. Use when the user asks to compare, benchmark, or rank multiple models or agents on a custom task, or run an arena-style evaluation.

633 Updated 3 days ago

agentscope-ai

AI & Automation Solid

bib-verify

Verify a BibTeX file for hallucinated or fabricated references by cross-checking every entry against CrossRef, arXiv, and DBLP. Reports each reference as verified, suspect, or not found, with field-level mismatch details (title, authors, year, DOI). Use when the user wants to check a .bib file for fake citations, validate references in a paper, or audit bibliography entries for accuracy.

633 Updated 3 days ago

agentscope-ai

AI & Automation Listed

hallucination-check

Citation-grounding verifier — for each claim in an LLM response, confirm support in the retrieved context, report ungrounded claims

2 Updated today

bakw00ds

AI & Automation Solid

openjudge

Build custom LLM evaluation pipelines using the OpenJudge framework. Covers selecting and configuring graders (LLM-based, function-based, agentic), running batch evaluations with GradingRunner, combining scores with aggregators, applying evaluation strategies (voting, average), auto-generating graders from data, and analyzing results (pairwise win rates, statistics, validation metrics). Use when the user wants to evaluate LLM outputs, compare multiple models, design scoring criteria, or build an automated evaluation system.

633 Updated 3 days ago

agentscope-ai

AI & Automation Solid

llm-judge

Use when comparing two or more code implementations against a spec or requirements doc. Triggers on "which repo is better", "compare these implementations", "evaluate both solutions", "rank these codebases", or "judge which approach wins". Also covers choosing between competing PRs or vendor submissions solving the same problem. Does NOT review a single codebase for quality — use code review skills instead. Does NOT evaluate strategy docs — use strategy-review. Requires a spec file and 2+ repo paths.

61 Updated today

existential-birds