Running Evals
Pass your keys, pick an image, read the JSON.
Quick Start
docker run \
-e ANTHROPIC_API_KEY=$ANTHROPIC_API_KEY \
ghcr.io/514-labs/dec-bench:foo-bar-csv-ingest.base-rt.claude-code.sonnet-4.v0.1.0
-e OPENAI_API_KEY=$OPENAI_API_KEY \
-e CURSOR_API_KEY=$CURSOR_API_KEY \
ghcr.io/514-labs/dec-bench:foo-bar-csv-ingest.base-rt.claude-code.sonnet-4.v0.1.0The image tag tells you exactly what runs: {scenario}.{harness}.{agent}.{model}.{version}. Results print to stdout as JSON.
v0.1 Scenarios
Thirty-six scenarios are available on the Foo Bar synthetic domain (13 tier-1, 18 tier-2, 5 tier-3). See Foo Bar Domain for the full list.
Example image tags:
foo-bar-csv-ingest.base-rt.claude-code.sonnet-4.v0.1.0
foo-bar-csv-ingest.base-rt.codex.gpt-5-codex.v0.1.0
foo-bar-csv-ingest.base-rt.cursor.auto.v0.1.0
foo-bar-time-grain-rollups.classic-de.claude-code.sonnet-4.v0.1.0
foo-bar-cross-system-reconciliation.olap-for-swe.claude-code.sonnet-4.v0.1.0API Keys
Each image validates the key it needs at startup. Pass all your keys -- the container uses what it needs and ignores the rest.
| Provider | Variable | Agents |
|---|---|---|
| Anthropic | ANTHROPIC_API_KEY | Claude Code |
| OpenAI | OPENAI_API_KEY | Codex |
| OpenAI (Codex override) | CODEX_API_KEY | Codex |
| Cursor | CURSOR_API_KEY | Cursor |
Using the CLI
The dec-bench CLI wraps Docker image building and execution:
dec-bench run \
--scenario foo-bar-csv-ingest \
--harness base-rt \
--agent claude-code \
--model claude-sonnet-4-20250514Other supported agent IDs:
codex(for OpenAI Codex CLI)cursor(for Cursor Agent CLI)
List available scenarios:
dec-bench list --domain foo-barView results from completed runs:
dec-bench results --scenario foo-bar-csv-ingestBuild Matrix
Export your keys once, loop over every image:
export ANTHROPIC_API_KEY=sk-ant-...
for scenario in foo-bar-csv-ingest foo-bar-slow-queries foo-bar-table-layout; do
for harness in base-rt classic-de olap-for-swe; do
docker run \
-e ANTHROPIC_API_KEY=$ANTHROPIC_API_KEY \
"ghcr.io/514-labs/dec-bench:${scenario}.${harness}.claude-code.sonnet-4.v0.1.0" \
>> results.jsonl
done
doneRun a Haiku 4.5 variant:
dec-bench run \
--scenario foo-bar-csv-ingest \
--harness base-rt \
--agent claude-code \
--model haiku-4.5Custom Agents
Mount your agent and point to it:
docker run \
-e GOOGLE_API_KEY=$GOOGLE_API_KEY \
-e CUSTOM_AGENT_CMD="/workspace/my-agent" \
-v ./my-agent:/workspace/my-agent \
ghcr.io/514-labs/dec-bench:foo-bar-csv-ingest.base-rt.custom.v0.1.0Custom agents declare their own key requirements. If running in restricted network mode, provide an iptables.sh to allowlist your provider's API endpoint.