benchmark-sandboxlisted
Install: claude install-skill build-with-dhiraj/ai-workflow-framework-portability-kit
# Benchmark Sandbox — Remote Eval via Vercel Sandboxes
Run benchmark scenarios inside Vercel Sandboxes — ephemeral Firecracker microVMs with node24. Each sandbox gets a fresh Claude Code + Vercel CLI + agent-browser install, the local vercel-plugin uploaded, and runs a **3-phase eval pipeline**:
- **Phase 1 (BUILD)**: Claude Code builds the app with `--dangerously-skip-permissions --debug`
- **Phase 2 (VERIFY)**: A follow-up Claude Code session uses `agent-browser` to walk through user stories, fixing issues until all pass (20 min timeout)
- **Phase 3 (DEPLOY)**: A third Claude Code session links to vercel-labs, runs `vercel deploy`, and fixes build errors (up to 3 retries). Deployed apps have deployment protection enabled by default.
Skills are tracked across **all 3 phases** — each phase may trigger additional skill injections as new files/patterns are created. After each phase, a **haiku structured scoring step** (`claude -p --json-schema --model haiku`) evaluates the results as structured JSON.
## Proven Working Script
Use `run-eval.ts` — the proven eval runner:
```bash
# Run default scenarios with full 3-phase pipeline
bun run .claude/skills/benchmark-sandbox/run-eval.ts
# With dynamic scenarios from a JSON file (recommended — see "Dynamic Scenarios" below)
bun run .claude/skills/benchmark-sandbox/run-eval.ts --scenarios-file /tmp/my-scenarios.json
# Keep sandboxes alive overnight with public URLs
bun run .claude/skills/benchmark-sandbox/run-eval.ts --keep-alive