Documentation

Quickstart

Install the CLI, build and run a benchmark scenario, view results, and open the audit UI. This will allow you to replicate our results, create your own scenarios, and contribute data to the benchmark.

Prerequisites

Node.js installed (required for npx in step 3 and pnpm install in step 5)
Docker installed and running
An API key for the agent you want to test:
- Claude Code: export ANTHROPIC_API_KEY=<key>
- Codex: export OPENAI_API_KEY=<key>
- Cursor: export CURSOR_API_KEY=<key>

1. Clone the repository

The CLI requires a local checkout of the repository for scenario definitions and build scripts.

Clone and enter the repo

git clone https://github.com/514-labs/agent-evals.git
cd agent-evals

2. Install the CLI

Install dec-bench

curl -fsSL https://decbench.ai/install.sh | sh

3. Install agent skills

DEC Bench ships skills that let your coding agent handle building, running, and creating scenarios. Install them for whichever agents you use:

Skill	What it does
`dec-bench-quickstart`	Validates your setup and runs your first scenario
`dec-bench-run`	Runs scenarios, compares agents, and handles complex multi-scenario test requests
`dec-bench-create-scenario`	Walks you through creating a new scenario from scratch

Install skills

npx skills add . --skill dec-bench-quickstart -a claude-code -a cursor -a codex
npx skills add . --skill dec-bench-run -a claude-code -a cursor -a codex
npx skills add . --skill dec-bench-create-scenario -a claude-code -a cursor -a codex

Once installed, open your agent in the repo directory (e.g. claude for Claude Code, or open the folder in Cursor) and try:

Example prompt

Help me get started with DEC Bench — build and run
the foo-bar-csv-ingest scenario with Claude Code

Your agent will walk through steps 4–5 below, including setting up the API key, building the image, running the benchmark, and showing you results. If you prefer to do it manually, continue below.

4. Build and run the benchmark

Export the API key for your agent, then build and run. You can do this via the CLI or by asking your agent:

CLI

export ANTHROPIC_API_KEY=<your-key>
dec-bench build --scenario foo-bar-csv-ingest
dec-bench run --scenario foo-bar-csv-ingest

Or ask your agent

Run the foo-bar-csv-ingest scenario with Claude Code

The CLI starts a container, launches the agent against the scenario prompt, runs the validation gates, and writes result artifacts. At the end it prints a run summary with the Run ID. Copy this for the next steps.

Run summary
Run ID: foo-bar-csv-ingest-claude-code-...-1776118371671
Gate/score: highest gate 0 | normalized score 0.100
Result file: results/foo-bar-csv-ingest-...-1776118371671.json

5. View results

You can view results two ways: in your terminal or in a browser. The browser view makes agent traces, assertion results, and run metadata much easier to understand at a glance.

In your terminal

Print gate scores and artifact paths directly:

Inspect the run

dec-bench results --run-id <run-id>

This prints the run metadata, gate scores, and paths to all artifacts:

Selected run
Run ID: foo-bar-csv-ingest-claude-code-...
Scenario: foo-bar-csv-ingest
Harness: base-rt
Agent: claude-code
Model: claude-sonnet-4-20250514
Version: v0.1.0
Highest gate: 0
Normalized score: 0.1000
Result file: results/foo-bar-csv-ingest-...json
Artifacts:
  stdout: results/foo-bar-csv-ingest-...stdout
  trace: results/foo-bar-csv-ingest-...trace.json
  agent raw: results/foo-bar-csv-ingest-...agent-raw.json
  ...

In a browser

The audit UI is a local Next.js app. Install its dependencies once before opening it for the first time:

Install web dependencies (first time only)

pnpm install

If pnpm is not installed, run npm install -g pnpm first (requires Node.js).

Open the audit UI

dec-bench audit open \
  --scenario foo-bar-csv-ingest \
  --run-id <run-id>

Next steps

Run more scenarios: dec-bench list shows all available scenarios, or ask your agent to run one for you
Create your own scenarios: see Add a Scenario
Run the full matrix: dec-bench run --matrix --parallel auto runs all agent/harness/persona combinations

Error: Missing required API key for agent 'claude-code':

  - ANTHROPIC_API_KEY

Set it before running:

    export ANTHROPIC_API_KEY=<your-key>

Port conflict on audit open

If port 3000 is already in use, the audit command will fail. Free the port or specify an alternative:

dec-bench audit open --scenario foo-bar-csv-ingest --run-id <run-id> --port 3001