← ClaudeAtlas

benchmark-sandboxlisted

Run vercel-plugin eval scenarios in Vercel Sandboxes instead of local WezTerm panels. Provisions ephemeral microVMs with Claude Code + plugin pre-installed, runs benchmark prompts, extracts hook artifacts, and produces coverage reports.
build-with-dhiraj/ai-workflow-framework-portability-kit · ★ 2 · AI & Automation · score 75
Install: claude install-skill build-with-dhiraj/ai-workflow-framework-portability-kit
# Benchmark Sandbox — Remote Eval via Vercel Sandboxes Run benchmark scenarios inside Vercel Sandboxes — ephemeral Firecracker microVMs with node24. Each sandbox gets a fresh Claude Code + Vercel CLI + agent-browser install, the local vercel-plugin uploaded, and runs a **3-phase eval pipeline**: - **Phase 1 (BUILD)**: Claude Code builds the app with `--dangerously-skip-permissions --debug` - **Phase 2 (VERIFY)**: A follow-up Claude Code session uses `agent-browser` to walk through user stories, fixing issues until all pass (20 min timeout) - **Phase 3 (DEPLOY)**: A third Claude Code session links to vercel-labs, runs `vercel deploy`, and fixes build errors (up to 3 retries). Deployed apps have deployment protection enabled by default. Skills are tracked across **all 3 phases** — each phase may trigger additional skill injections as new files/patterns are created. After each phase, a **haiku structured scoring step** (`claude -p --json-schema --model haiku`) evaluates the results as structured JSON. ## Proven Working Script Use `run-eval.ts` — the proven eval runner: ```bash # Run default scenarios with full 3-phase pipeline bun run .claude/skills/benchmark-sandbox/run-eval.ts # With dynamic scenarios from a JSON file (recommended — see "Dynamic Scenarios" below) bun run .claude/skills/benchmark-sandbox/run-eval.ts --scenarios-file /tmp/my-scenarios.json # Keep sandboxes alive overnight with public URLs bun run .claude/skills/benchmark-sandbox/run-eval.ts --keep-alive