nasde-benchmark-creator

Solid

Create coding agent benchmarks for evaluation with nasde. Use this skill when the user wants to: - Create a new benchmark project (set of tasks for evaluating coding agents) - Add tasks to an existing benchmark - Create or modify agent variants (configurations that control agent behavior) - Set up assessment dimensions and scoring criteria - Verify that a new benchmark's Docker environment and tests work Even if the user doesn't say "benchmark" — if they're talking about creating coding challenges for AI agents or setting up evaluation criteria, this skill applies.

AI & Automation 11 stars 0 forks Updated today MIT

Install

View on GitHub

Quality Score: 79/100

Stars 20%

Recency 20%

100

Frontmatter 20%

Documentation 15%

100

Issue Health 10%

License 10%

100

Description 5%

100

Skill Content

# NASDE Benchmark Creator Create and configure coding agent benchmarks for evaluation with `nasde`. A benchmark is a set of coding tasks that AI agents solve inside isolated Docker containers, scored both by functional tests (pass/fail) and by an LLM-as-a-Judge architecture assessment. ## Critical: line endings on Windows (read this first) Benchmark scripts execute inside **Linux** sandboxes (Docker, Daytona). If `tests/test.sh`, `solution/solve.sh`, or `environment/Dockerfile` are checked out with **CRLF** line endings (the Windows git default when `core.autocrlf=true` and there is no `.gitattributes`), every trial fails immediately with: ``` bash: line 1: /tests/test.sh: cannot execute: required file not found ``` …because the kernel reads the shebang as `#!/bin/bash\r` and tries to execute a non-existent `/bin/bash\r`. The agent finishes its work, but the verifier never runs and Harbor reports `RewardFileNotFoundError`. **Mitigation (always do this for a new benchmark — `nasde init` does it for you, but verify):** 1. The benchmark repo MUST have a `.gitattributes` file enforcing LF for shell scripts and Dockerfiles. The minimum content: ```gitattributes * text=auto eol=lf *.sh text eol=lf *.bash text eol=lf Dockerfile text eol=lf *.dockerfile text eol=lf docker-compose.yaml text eol=lf docker-compose.yml text eol=lf *.ps1 text eol=crlf *.bat text eol=crlf *.cmd text eol=crlf ``` `nasde init` wri...

Details

Author: NoesisVision
Repository: NoesisVision/nasde-toolkit
Created: 4 months ago
Last Updated: today
Language: Python
License: MIT

Integrates with

Docker · Infrastructure

Similar Skills

Semantically similar based on skill content — not just same category

AI & Automation Solid

nasde-benchmark-from-public-repos

Build diverse benchmark task suites from public GitHub repositories for testing universal skills. Use this skill when the user wants to: - Create a benchmark that spans multiple public repositories and languages - Test a universal skill (refactoring, test writing, code review, etc.) across diverse codebases - Curate a representative set of repos and tasks for cross-codebase validation - Build an evaluation suite for a skill that should work in any repository Even if the user doesn't say "benchmark" — if they're building a skill meant to work everywhere and want to validate it across many different projects, this skill applies.

11 Updated today

NoesisVision

AI & Automation Solid

nasde-benchmark-from-history

Generate benchmark tasks from git history of the current or specified repository. Use this skill when the user wants to: - Create benchmark tasks based on real problems their team already solved (closed PRs, past commits, resolved issues) - Mine git history for good evaluation candidates - Turn a commit range or set of PRs into a NASDE benchmark - Build a regression test suite from their team's actual work Even if the user doesn't say "benchmark" — if they're talking about turning past work into evaluation tasks, or want to test AI agents against problems they've already solved, this skill applies.

11 Updated today

NoesisVision

AI & Automation Solid

nasde-benchmark-runner

Run coding agent benchmarks and verify results with nasde. Use this skill when the user wants to: - Run a benchmark (all tasks, single task, specific variant) - Re-run assessment evaluation on existing trial results - Check or verify results in Opik (traces, feedback scores, experiments) - Troubleshoot a failed benchmark run - View or compare trial results Even if the user doesn't say "benchmark" — if they're talking about running evaluations, checking scores, or analyzing agent performance, this skill applies. After every run that uses --with-opik, ALWAYS verify results via Opik REST API — don't wait for the user to ask.

11 Updated today

NoesisVision