nemo-evaluator-sdk

Solid

Evaluates LLMs across 100+ benchmarks from 18+ harnesses (MMLU, HumanEval, GSM8K, safety, VLM) with multi-backend execution. Use when needing scalable evaluation on local Docker, Slurm HPC, or cloud platforms. NVIDIA's enterprise-grade platform with container-first architecture for reproducible benchmarking.

DevOps & Infrastructure 9,182 stars 697 forks Updated 1 months ago MIT

Install

View on GitHub

Quality Score: 94/100

Stars 20%

100

Recency 20%

Frontmatter 20%

Documentation 15%

100

Issue Health 10%

License 10%

100

Description 5%

100

Skill Content

# NeMo Evaluator SDK - Enterprise LLM Benchmarking ## Quick Start NeMo Evaluator SDK evaluates LLMs across 100+ benchmarks from 18+ harnesses using containerized, reproducible evaluation with multi-backend execution (local Docker, Slurm HPC, Lepton cloud). **Installation**: ```bash pip install nemo-evaluator-launcher ``` **Set API key and run evaluation**: ```bash export NGC_API_KEY=nvapi-your-key-here # Create minimal config cat > config.yaml << 'EOF' defaults: - execution: local - deployment: none - _self_ execution: output_dir: ./results target: api_endpoint: model_id: meta/llama-3.1-8b-instruct url: https://integrate.api.nvidia.com/v1/chat/completions api_key_name: NGC_API_KEY evaluation: tasks: - name: ifeval EOF # Run evaluation nemo-evaluator-launcher run --config-dir . --config-name config ``` **View available tasks**: ```bash nemo-evaluator-launcher ls tasks ``` ## Common Workflows ### Workflow 1: Evaluate Model on Standard Benchmarks Run core academic benchmarks (MMLU, GSM8K, IFEval) on any OpenAI-compatible endpoint. **Checklist**: ``` Standard Evaluation: - [ ] Step 1: Configure API endpoint - [ ] Step 2: Select benchmarks - [ ] Step 3: Run evaluation - [ ] Step 4: Check results ``` **Step 1: Configure API endpoint** ```yaml # config.yaml defaults: - execution: local - deployment: none - _self_ execution: output_dir: ./results target: api_endpoint: model_id: meta/llama-3.1-8b-instruct url: https://int...

Details

Author: Orchestra-Research
Repository: Orchestra-Research/AI-Research-SKILLs
Created: 7 months ago
Last Updated: 1 months ago
Language: TeX
License: MIT

Integrates with

OpenAI · AI Hugging Face · AI Docker · Infrastructure

Similar Skills

Semantically similar based on skill content — not just same category

DevOps & Infrastructure Featured

nemo-evaluator-sdk

27,705 Updated today

davila7

AI & Automation Featured

evaluating-llms-harness

Evaluates LLMs across 60+ academic benchmarks (MMLU, HumanEval, GSM8K, TruthfulQA, HellaSwag). Use when benchmarking model quality, comparing models, reporting academic results, or tracking training progress. Industry standard used by EleutherAI, HuggingFace, and major labs. Supports HuggingFace, vLLM, APIs.

27,705 Updated today

davila7

AI & Automation Solid