evaluating-code-modelslisted
Install: claude install-skill immacualate/claude-forge
# BigCode Evaluation Harness - Code Model Benchmarking
## Quick Start
BigCode Evaluation Harness evaluates code generation models across 15+ benchmarks including HumanEval, MBPP, and MultiPL-E (18 languages).
**Installation**:
```bash
git clone https://github.com/bigcode-project/bigcode-evaluation-harness.git
cd bigcode-evaluation-harness
pip install -e .
accelerate config
```
**Evaluate on HumanEval**:
```bash
accelerate launch main.py \
--model bigcode/starcoder2-7b \
--tasks humaneval \
--max_length_generation 512 \
--temperature 0.2 \
--n_samples 20 \
--batch_size 10 \
--allow_code_execution \
--save_generations
```
**View available tasks**:
```bash
python -c "from bigcode_eval.tasks import ALL_TASKS; print(ALL_TASKS)"
```
## Common Workflows
### Workflow 1: Standard Code Benchmark Evaluation
Evaluate model on core code benchmarks (HumanEval, MBPP, HumanEval+).
**Checklist**:
```
Code Benchmark Evaluation:
- [ ] Step 1: Choose benchmark suite
- [ ] Step 2: Configure model and generation
- [ ] Step 3: Run evaluation with code execution
- [ ] Step 4: Analyze pass@k results
```
**Step 1: Choose benchmark suite**
**Python code generation** (most common):
- **HumanEval**: 164 handwritten problems, function completion
- **HumanEval+**: Same 164 problems with 80× more tests (stricter)
- **MBPP**: 500 crowd-sourced problems, entry-level difficulty
- **MBPP+**: 399 curated problems with 35× more tests
**Multi-language** (18 languages):
- **MultiPL-E**: