ref-hallucination-arena
SolidBenchmark LLM reference recommendation capabilities by verifying every cited paper against Crossref, PubMed, arXiv, and DBLP. Measures hallucination rate, per-field accuracy (title/author/year/DOI), discipline breakdown, and year constraint compliance. Supports tool-augmented (ReAct + web search) mode. Use when the user asks to evaluate, benchmark, or compare models on academic reference hallucination, literature recommendation quality, or citation accuracy.
Install
Quality Score: 92/100
Skill Content
Details
- Author
- agentscope-ai
- Repository
- agentscope-ai/OpenJudge
- Created
- 10 months ago
- Last Updated
- 3 days ago
- Language
- Python
- License
- Apache-2.0
Integrates with
Similar Skills
Semantically similar based on skill content — not just same category
auto-arena
Automatically evaluate and compare multiple AI models or agents without pre-existing test data. Generates test queries from a task description, collects responses from all target endpoints, auto-generates evaluation rubrics, runs pairwise comparisons via a judge model, and produces win-rate rankings with reports and charts. Supports checkpoint resume, incremental endpoint addition, and judge model hot-swap. Use when the user asks to compare, benchmark, or rank multiple models or agents on a custom task, or run an arena-style evaluation.
bib-verify
Verify a BibTeX file for hallucinated or fabricated references by cross-checking every entry against CrossRef, arXiv, and DBLP. Reports each reference as verified, suspect, or not found, with field-level mismatch details (title, authors, year, DOI). Use when the user wants to check a .bib file for fake citations, validate references in a paper, or audit bibliography entries for accuracy.
hallucination-check
Citation-grounding verifier — for each claim in an LLM response, confirm support in the retrieved context, report ungrounded claims
openjudge
Build custom LLM evaluation pipelines using the OpenJudge framework. Covers selecting and configuring graders (LLM-based, function-based, agentic), running batch evaluations with GradingRunner, combining scores with aggregators, applying evaluation strategies (voting, average), auto-generating graders from data, and analyzing results (pairwise win rates, statistics, validation metrics). Use when the user wants to evaluate LLM outputs, compare multiple models, design scoring criteria, or build an automated evaluation system.
llm-judge
Use when comparing two or more code implementations against a spec or requirements doc. Triggers on "which repo is better", "compare these implementations", "evaluate both solutions", "rank these codebases", or "judge which approach wins". Also covers choosing between competing PRs or vendor submissions solving the same problem. Does NOT review a single codebase for quality — use code review skills instead. Does NOT evaluate strategy docs — use strategy-review. Requires a spec file and 2+ repo paths.