← All creators

NoesisVision

Organization

CLI for benchmarks & evals of AI coding agents — on tasks you already understand, using your Claude / Codex / Gemini individual subscriptions or API keys.

11 indexed · 0 Featured · 10 stars · avg score 80
Prolific

Categories

Indexed Skills (11)

AI & Automation Listed

tactical-ddd

Design, refactor, analyze, and review code by applying the principles and patterns of tactical domain-driven design. Triggers on: domain modeling, aggregate design, 'entity', 'value object', 'repository', 'bounded context', 'domain event', 'domain service', code touching domain/ directories, rich domain model discussions.

10 Updated yesterday
NoesisVision
Testing & QA Listed

python-testing

Python testing strategies using pytest, TDD methodology, fixtures, mocking, parametrization, and coverage requirements.

10 Updated yesterday
NoesisVision
AI & Automation Listed

refactor

Surgical code refactoring to improve maintainability without changing behavior. Covers extracting functions, renaming variables, breaking down god functions, improving type safety, eliminating code smells, and applying design patterns. Less drastic than repo-rebuilder; use for gradual improvements.

10 Updated yesterday
NoesisVision
AI & Automation Listed

nasde-benchmark-from-history

Generate benchmark tasks from git history of the current or specified repository. Use this skill when the user wants to: - Create benchmark tasks based on real problems their team already solved (closed PRs, past commits, resolved issues) - Mine git history for good evaluation candidates - Turn a commit range or set of PRs into a NASDE benchmark - Build a regression test suite from their team's actual work Even if the user doesn't say "benchmark" — if they're talking about turning past work into evaluation tasks, or want to test AI agents against problems they've already solved, this skill applies.

10 Updated yesterday
NoesisVision
AI & Automation Listed

nasde-benchmark-from-public-repos

Build diverse benchmark task suites from public GitHub repositories for testing universal skills. Use this skill when the user wants to: - Create a benchmark that spans multiple public repositories and languages - Test a universal skill (refactoring, test writing, code review, etc.) across diverse codebases - Curate a representative set of repos and tasks for cross-codebase validation - Build an evaluation suite for a skill that should work in any repository Even if the user doesn't say "benchmark" — if they're building a skill meant to work everywhere and want to validate it across many different projects, this skill applies.

10 Updated yesterday
NoesisVision
AI & Automation Listed

nasde-dev

Internal skill for developing and maintaining nasde-toolkit itself. Use this skill when: - Making changes to nasde-toolkit source code (CLI, runner, evaluator, config, agents) - Refactoring or adding features to the toolkit - Fixing bugs in the evaluation pipeline - Updating dependencies or integration points (Harbor, Opik, `claude` / `codex` CLI subprocess backends) This skill defines the verification protocol that must be followed after any significant change.

10 Updated yesterday
NoesisVision
Code & Development Listed

code-review

Use when reviewing AI-generated code for architectural quality, design patterns, and engineering practices

10 Updated yesterday
NoesisVision
AI & Automation Listed

python-best-practices

Provides Python patterns for type-first development with dataclasses, discriminated unions, NewType, and Protocol. Must use when reading or writing Python files.

10 Updated yesterday
NoesisVision
Code & Development Listed

nasde-benchmark-calibration

Calibrate assessment rubrics by reviewing agent work in GitHub/GitLab PRs and feeding human comments back into the rubric. Use this skill when the user wants to: - Calibrate, tune, or sanity-check assessment criteria / dimensions of a benchmark - Review trial diffs alongside the LLM-as-a-Judge scores in a PR/MR - Investigate why judge scores feel off, too harsh, too lenient, or misaligned with how a human would grade the code - Pull review comments back from PRs/MRs and turn them into concrete rubric edits Even if the user doesn't say "calibrate" — if they're worried the LLM judge's scores diverge from human judgment, or want to align scores with a real developer's opinion before freezing a benchmark, this skill applies.

10 Updated yesterday
NoesisVision
AI & Automation Listed

nasde-benchmark-creator

Create coding agent benchmarks for evaluation with nasde. Use this skill when the user wants to: - Create a new benchmark project (set of tasks for evaluating coding agents) - Add tasks to an existing benchmark - Create or modify agent variants (configurations that control agent behavior) - Set up assessment dimensions and scoring criteria - Verify that a new benchmark's Docker environment and tests work Even if the user doesn't say "benchmark" — if they're talking about creating coding challenges for AI agents or setting up evaluation criteria, this skill applies.

10 Updated yesterday
NoesisVision
AI & Automation Listed

nasde-benchmark-runner

Run coding agent benchmarks and verify results with nasde. Use this skill when the user wants to: - Run a benchmark (all tasks, single task, specific variant) - Re-run assessment evaluation on existing trial results - Check or verify results in Opik (traces, feedback scores, experiments) - Troubleshoot a failed benchmark run - View or compare trial results Even if the user doesn't say "benchmark" — if they're talking about running evaluations, checking scores, or analyzing agent performance, this skill applies. After every run that uses --with-opik, ALWAYS verify results via Opik REST API — don't wait for the user to ask.

10 Updated yesterday
NoesisVision

Bio shown is the top-scored skill's repo description as a fallback — real GitHub bios land in a future update.