evaluating-code-models

Solid

Evaluates code generation models across HumanEval, MBPP, MultiPL-E, and 15+ benchmarks with pass@k metrics. Use when benchmarking code models, comparing coding abilities, testing multi-language support, or measuring code generation quality. Industry standard from BigCode Project used by HuggingFace leaderboards.

AI & Automation 2,279 stars 168 forks Updated 3 weeks ago Apache-2.0

Install

View on GitHub

Quality Score: 97/100

Stars 20%
100
Recency 20%
90
Frontmatter 20%
70
Documentation 15%
100
Issue Health 10%
50
License 10%
100
Description 5%
100

Skill Content

# BigCode Evaluation Harness - Code Model Benchmarking ## Quick Start BigCode Evaluation Harness evaluates code generation models across 15+ benchmarks including HumanEval, MBPP, and MultiPL-E (18 languages). **Installation**: ```bash git clone https://github.com/bigcode-project/bigcode-evaluation-harness.git cd bigcode-evaluation-harness pip install -e . accelerate config ``` **Evaluate on HumanEval**: ```bash accelerate launch main.py \ --model bigcode/starcoder2-7b \ --tasks humaneval \ --max_length_generation 512 \ --temperature 0.2 \ --n_samples 20 \ --batch_size 10 \ --allow_code_execution \ --save_generations ``` **View available tasks**: ```bash python -c "from bigcode_eval.tasks import ALL_TASKS; print(ALL_TASKS)" ``` ## Common Workflows ### Workflow 1: Standard Code Benchmark Evaluation Evaluate model on core code benchmarks (HumanEval, MBPP, HumanEval+). **Checklist**: ``` Code Benchmark Evaluation: - [ ] Step 1: Choose benchmark suite - [ ] Step 2: Configure model and generation - [ ] Step 3: Run evaluation with code execution - [ ] Step 4: Analyze pass@k results ``` **Step 1: Choose benchmark suite** **Python code generation** (most common): - **HumanEval**: 164 handwritten problems, function completion - **HumanEval+**: Same 164 problems with 80× more tests (stricter) - **MBPP**: 500 crowd-sourced problems, entry-level difficulty - **MBPP+**: 399 curated problems with 35× more tests **Multi-language** (18 languages): - **MultiPL-E**: ...

Details

Author
foryourhealth111-pixel
Repository
foryourhealth111-pixel/Vibe-Skills
Created
3 months ago
Last Updated
3 weeks ago
Language
Python
License
Apache-2.0

Integrates with

Similar Skills

Semantically similar based on skill content — not just same category