Documentation
Quickstart
Install the CLI, build and run a benchmark scenario, view results, and open the audit UI. This will allow you to replicate our results, create your own scenarios, and contribute data to the benchmark.
Prerequisites
- Node.js installed (required for
npxin step 3 andpnpm installin step 5) - Docker installed and running
- An API key for the agent you want to test:
- Claude Code:
export ANTHROPIC_API_KEY=<key> - Codex:
export OPENAI_API_KEY=<key> - Cursor:
export CURSOR_API_KEY=<key>
- Claude Code:
1. Clone the repository
The CLI requires a local checkout of the repository for scenario definitions and build scripts.
git clone https://github.com/514-labs/agent-evals.git
cd agent-evals2. Install the CLI
curl -fsSL https://decbench.ai/install.sh | sh3. Install agent skills
DEC Bench ships skills that let your coding agent handle building, running, and creating scenarios. Install them for whichever agents you use:
| Skill | What it does |
|---|---|
dec-bench-quickstart | Validates your setup and runs your first scenario |
dec-bench-run | Runs scenarios, compares agents, and handles complex multi-scenario test requests |
dec-bench-create-scenario | Walks you through creating a new scenario from scratch |
npx skills add . --skill dec-bench-quickstart -a claude-code -a cursor -a codex
npx skills add . --skill dec-bench-run -a claude-code -a cursor -a codex
npx skills add . --skill dec-bench-create-scenario -a claude-code -a cursor -a codexOnce installed, open your agent in the repo directory (e.g. claude for Claude Code, or open the folder in Cursor) and try:
Help me get started with DEC Bench — build and run
the foo-bar-csv-ingest scenario with Claude CodeYour agent will walk through steps 4–5 below, including setting up the API key, building the image, running the benchmark, and showing you results. If you prefer to do it manually, continue below.
4. Build and run the benchmark
Export the API key for your agent, then build and run. You can do this via the CLI or by asking your agent:
export ANTHROPIC_API_KEY=<your-key>
dec-bench build --scenario foo-bar-csv-ingest
dec-bench run --scenario foo-bar-csv-ingestRun the foo-bar-csv-ingest scenario with Claude CodeThe CLI starts a container, launches the agent against the scenario prompt, runs the validation gates, and writes result artifacts. At the end it prints a run summary with the Run ID. Copy this for the next steps.
Run summary
Run ID: foo-bar-csv-ingest-claude-code-...-1776118371671
Gate/score: highest gate 0 | normalized score 0.100
Result file: results/foo-bar-csv-ingest-...-1776118371671.json5. View results
You can view results two ways: in your terminal or in a browser. The browser view makes agent traces, assertion results, and run metadata much easier to understand at a glance.
In your terminal
Print gate scores and artifact paths directly:
dec-bench results --run-id <run-id>This prints the run metadata, gate scores, and paths to all artifacts:
Selected run
Run ID: foo-bar-csv-ingest-claude-code-...
Scenario: foo-bar-csv-ingest
Harness: base-rt
Agent: claude-code
Model: claude-sonnet-4-20250514
Version: v0.1.0
Highest gate: 0
Normalized score: 0.1000
Result file: results/foo-bar-csv-ingest-...json
Artifacts:
stdout: results/foo-bar-csv-ingest-...stdout
trace: results/foo-bar-csv-ingest-...trace.json
agent raw: results/foo-bar-csv-ingest-...agent-raw.json
...In a browser
The audit UI is a local Next.js app. Install its dependencies once before opening it for the first time:
pnpm installIf pnpm is not installed, run npm install -g pnpm first (requires Node.js).
dec-bench audit open \
--scenario foo-bar-csv-ingest \
--run-id <run-id>Next steps
- Run more scenarios:
dec-bench listshows all available scenarios, or ask your agent to run one for you - Create your own scenarios: see Add a Scenario
- Run the full matrix:
dec-bench run --matrix --parallel autoruns all agent/harness/persona combinations
Troubleshooting
Docker is not running
If dec-bench build or dec-bench run fails with a Docker connection error, start Docker Desktop (or the Docker daemon) and retry.
Missing API key
If the API key is not set, dec-bench run exits immediately:
Error: Missing required API key for agent 'claude-code':
- ANTHROPIC_API_KEY
Set it before running:
export ANTHROPIC_API_KEY=<your-key>Port conflict on audit open
If port 3000 is already in use, the audit command will fail. Free the port or specify an alternative:
dec-bench audit open --scenario foo-bar-csv-ingest --run-id <run-id> --port 3001