DOCUMENTATION

Running Evals

Pass your keys, pick an image, read the JSON.

Quick Start

Run a single eval
docker run \
  -e ANTHROPIC_API_KEY=$ANTHROPIC_API_KEY \
  ghcr.io/514-labs/dec-bench:foo-bar-csv-ingest.base-rt.claude-code.sonnet-4.v0.1.0
  -e OPENAI_API_KEY=$OPENAI_API_KEY \
  -e CURSOR_API_KEY=$CURSOR_API_KEY \
  ghcr.io/514-labs/dec-bench:foo-bar-csv-ingest.base-rt.claude-code.sonnet-4.v0.1.0

The image tag tells you exactly what runs: {scenario}.{harness}.{agent}.{model}.{version}. Results print to stdout as JSON.

v0.1 Scenarios

Thirty-six scenarios are available on the Foo Bar synthetic domain (13 tier-1, 18 tier-2, 5 tier-3). See Foo Bar Domain for the full list.

Example image tags:

foo-bar-csv-ingest.base-rt.claude-code.sonnet-4.v0.1.0
foo-bar-csv-ingest.base-rt.codex.gpt-5-codex.v0.1.0
foo-bar-csv-ingest.base-rt.cursor.auto.v0.1.0
foo-bar-time-grain-rollups.classic-de.claude-code.sonnet-4.v0.1.0
foo-bar-cross-system-reconciliation.olap-for-swe.claude-code.sonnet-4.v0.1.0

API Keys

Each image validates the key it needs at startup. Pass all your keys -- the container uses what it needs and ignores the rest.

ProviderVariableAgents
AnthropicANTHROPIC_API_KEYClaude Code
OpenAIOPENAI_API_KEYCodex
OpenAI (Codex override)CODEX_API_KEYCodex
CursorCURSOR_API_KEYCursor

Using the CLI

The dec-bench CLI wraps Docker image building and execution:

Run via CLI
dec-bench run \
  --scenario foo-bar-csv-ingest \
  --harness base-rt \
  --agent claude-code \
  --model claude-sonnet-4-20250514

Other supported agent IDs:

  • codex (for OpenAI Codex CLI)
  • cursor (for Cursor Agent CLI)

List available scenarios:

List foo-bar scenarios
dec-bench list --domain foo-bar

View results from completed runs:

View results
dec-bench results --scenario foo-bar-csv-ingest

Build Matrix

Export your keys once, loop over every image:

Run the full matrix
export ANTHROPIC_API_KEY=sk-ant-...

for scenario in foo-bar-csv-ingest foo-bar-slow-queries foo-bar-table-layout; do
  for harness in base-rt classic-de olap-for-swe; do
    docker run \
      -e ANTHROPIC_API_KEY=$ANTHROPIC_API_KEY \
      "ghcr.io/514-labs/dec-bench:${scenario}.${harness}.claude-code.sonnet-4.v0.1.0" \
      >> results.jsonl
  done
done

Run a Haiku 4.5 variant:

Run with Haiku 4.5
dec-bench run \
  --scenario foo-bar-csv-ingest \
  --harness base-rt \
  --agent claude-code \
  --model haiku-4.5

Custom Agents

Mount your agent and point to it:

Run a custom agent
docker run \
  -e GOOGLE_API_KEY=$GOOGLE_API_KEY \
  -e CUSTOM_AGENT_CMD="/workspace/my-agent" \
  -v ./my-agent:/workspace/my-agent \
  ghcr.io/514-labs/dec-bench:foo-bar-csv-ingest.base-rt.custom.v0.1.0

Custom agents declare their own key requirements. If running in restricted network mode, provide an iptables.sh to allowlist your provider's API endpoint.