Docs

Documentation

Quickstart

Install the CLI, build and run a benchmark scenario, view results, and open the audit UI. This will allow you to replicate our results, create your own scenarios, and contribute data to the benchmark.

Prerequisites

  • Node.js installed (required for npx in step 3 and pnpm install in step 5)
  • Docker installed and running
  • An API key for the agent you want to test:
    • Claude Code: export ANTHROPIC_API_KEY=<key>
    • Codex: export OPENAI_API_KEY=<key>
    • Cursor: export CURSOR_API_KEY=<key>

1. Clone the repository

The CLI requires a local checkout of the repository for scenario definitions and build scripts.

Clone and enter the repo
git clone https://github.com/514-labs/agent-evals.git
cd agent-evals

2. Install the CLI

Install dec-bench
curl -fsSL https://decbench.ai/install.sh | sh

3. Install agent skills

DEC Bench ships skills that let your coding agent handle building, running, and creating scenarios. Install them for whichever agents you use:

SkillWhat it does
dec-bench-quickstartValidates your setup and runs your first scenario
dec-bench-runRuns scenarios, compares agents, and handles complex multi-scenario test requests
dec-bench-create-scenarioWalks you through creating a new scenario from scratch
Install skills
npx skills add . --skill dec-bench-quickstart -a claude-code -a cursor -a codex
npx skills add . --skill dec-bench-run -a claude-code -a cursor -a codex
npx skills add . --skill dec-bench-create-scenario -a claude-code -a cursor -a codex

Once installed, open your agent in the repo directory (e.g. claude for Claude Code, or open the folder in Cursor) and try:

Example prompt
Help me get started with DEC Bench — build and run
the foo-bar-csv-ingest scenario with Claude Code

Your agent will walk through steps 4–5 below, including setting up the API key, building the image, running the benchmark, and showing you results. If you prefer to do it manually, continue below.

4. Build and run the benchmark

Export the API key for your agent, then build and run. You can do this via the CLI or by asking your agent:

CLI
export ANTHROPIC_API_KEY=<your-key>
dec-bench build --scenario foo-bar-csv-ingest
dec-bench run --scenario foo-bar-csv-ingest
Or ask your agent
Run the foo-bar-csv-ingest scenario with Claude Code

The CLI starts a container, launches the agent against the scenario prompt, runs the validation gates, and writes result artifacts. At the end it prints a run summary with the Run ID. Copy this for the next steps.

Run summary
Run ID: foo-bar-csv-ingest-claude-code-...-1776118371671
Gate/score: highest gate 0 | normalized score 0.100
Result file: results/foo-bar-csv-ingest-...-1776118371671.json

5. View results

You can view results two ways: in your terminal or in a browser. The browser view makes agent traces, assertion results, and run metadata much easier to understand at a glance.

In your terminal

Print gate scores and artifact paths directly:

Inspect the run
dec-bench results --run-id <run-id>

This prints the run metadata, gate scores, and paths to all artifacts:

Selected run
Run ID: foo-bar-csv-ingest-claude-code-...
Scenario: foo-bar-csv-ingest
Harness: base-rt
Agent: claude-code
Model: claude-sonnet-4-20250514
Version: v0.1.0
Highest gate: 0
Normalized score: 0.1000
Result file: results/foo-bar-csv-ingest-...json
Artifacts:
  stdout: results/foo-bar-csv-ingest-...stdout
  trace: results/foo-bar-csv-ingest-...trace.json
  agent raw: results/foo-bar-csv-ingest-...agent-raw.json
  ...

In a browser

The audit UI is a local Next.js app. Install its dependencies once before opening it for the first time:

Install web dependencies (first time only)
pnpm install

If pnpm is not installed, run npm install -g pnpm first (requires Node.js).

Open the audit UI
dec-bench audit open \
  --scenario foo-bar-csv-ingest \
  --run-id <run-id>

Next steps

  • Run more scenarios: dec-bench list shows all available scenarios, or ask your agent to run one for you
  • Create your own scenarios: see Add a Scenario
  • Run the full matrix: dec-bench run --matrix --parallel auto runs all agent/harness/persona combinations

Troubleshooting

Docker is not running

If dec-bench build or dec-bench run fails with a Docker connection error, start Docker Desktop (or the Docker daemon) and retry.

Missing API key

If the API key is not set, dec-bench run exits immediately:

Error: Missing required API key for agent 'claude-code':

  - ANTHROPIC_API_KEY

Set it before running:

    export ANTHROPIC_API_KEY=<your-key>

Port conflict on audit open

If port 3000 is already in use, the audit command will fail. Free the port or specify an alternative:

dec-bench audit open --scenario foo-bar-csv-ingest --run-id <run-id> --port 3001