evaluating-code-models

Solid

Evaluates code generation models across HumanEval, MBPP, MultiPL-E, and 15+ benchmarks with pass@k metrics. Use when benchmarking code models, comparing coding abilities, testing multi-language support, or measuring code generation quality. Industry standard from BigCode Project used by HuggingFace leaderboards.

AI & Automation 5 stars 0 forks Updated yesterday MIT

Install

View on GitHub

Quality Score: 83/100

Stars 20%

Recency 20%

100

Frontmatter 20%

Documentation 15%

100

Issue Health 10%

License 10%

100

Description 5%

100

Skill Content

# BigCode Evaluation Harness - Code Model Benchmarking ## Quick Start BigCode Evaluation Harness evaluates code generation models across 15+ benchmarks including HumanEval, MBPP, and MultiPL-E (18 languages). **Installation**: ```bash git clone https://github.com/bigcode-project/bigcode-evaluation-harness.git cd bigcode-evaluation-harness pip install -e . accelerate config ``` **Evaluate on HumanEval**: ```bash accelerate launch main.py \ --model bigcode/starcoder2-7b \ --tasks humaneval \ --max_length_generation 512 \ --temperature 0.2 \ --n_samples 20 \ --batch_size 10 \ --allow_code_execution \ --save_generations ``` **View available tasks**: ```bash python -c "from bigcode_eval.tasks import ALL_TASKS; print(ALL_TASKS)" ``` ## Common Workflows ### Workflow 1: Standard Code Benchmark Evaluation Evaluate model on core code benchmarks (HumanEval, MBPP, HumanEval+). **Checklist**: ``` Code Benchmark Evaluation: - [ ] Step 1: Choose benchmark suite - [ ] Step 2: Configure model and generation - [ ] Step 3: Run evaluation with code execution - [ ] Step 4: Analyze pass@k results ``` **Step 1: Choose benchmark suite** **Python code generation** (most common): - **HumanEval**: 164 handwritten problems, function completion - **HumanEval+**: Same 164 problems with 80× more tests (stricter) - **MBPP**: 500 crowd-sourced problems, entry-level difficulty - **MBPP+**: 399 curated problems with 35× more tests **Multi-language** (18 languages): - **MultiPL-E**: ...

Details

Author: immacualate
Repository: immacualate/claude-forge
Created: 1 years ago
Last Updated: yesterday
Language: Shell
License: MIT

Integrates with

Hugging Face · AI

Similar Skills

Semantically similar based on skill content — not just same category

AI & Automation Solid

evaluating-llms-harness

Evaluates LLMs across 60+ academic benchmarks (MMLU, HumanEval, GSM8K, TruthfulQA, HellaSwag). Use when benchmarking model quality, comparing models, reporting academic results, or tracking training progress. Industry standard used by EleutherAI, HuggingFace, and major labs. Supports HuggingFace, vLLM, APIs.

5 Updated yesterday

immacualate

AI & Automation Listed

benchmark-opencode-models

Deep-benchmark which OpenCode models are actually viable for opencode-bridge — ping each candidate model, then run 5 canned superpowers-style task prompts per model (feature vs bugfix, short vs detailed, plus a dedicated TDD red-to-green prompt), independently verify every result by executing the generated code (never trust OpenCode's self-reported "done"), and score each run on time/quality/completeness/autonomy/discipline/red-green-accuracy/test-call-discipline. For a fast pass/fail availability check with no scoring, use check-opencode-models instead.

0 Updated 5 days ago

darkstar1227

AI & Automation Listed

eval-harness

Formal evaluation framework for Claude Code sessions implementing eval-driven development (EDD) principles

3 Updated today

uzysjung