Documentation

Supported Agents

Agents, models, harnesses, and how to extend DEC Bench.

DEC Bench supports three agent CLIs today. Each agent connects to its provider's API and reports token usage back to the harness after the run completes.

Agents

Agent	Provider	CLI	Cost Source	Token Usage
Claude Code	Anthropic	`claude`	Agent-reported `total_cost_usd`	`input_tokens`, `output_tokens`, `cache_creation_tokens`, `cache_read_tokens`
Codex	OpenAI	`codex exec --json`	Derived from published pricing	`input_tokens`, `cached_input_tokens`, `output_tokens` via `turn.completed`
Cursor	Cursor	`agent --print`	Derived from published pricing	Server pass-through; not documented in CLI output schema

Claude Code is the only agent whose CLI emits an explicit dollar cost (total_cost_usd). We store it directly and mark llmApiCostSource = "agent-reported". Token buckets are also available and stored for audit detail.

Codex

Codex emits token usage on each turn.completed JSONL event but does not emit dollar cost. The Usage struct in the open-source CLI contains three fields: input_tokens, cached_input_tokens, and output_tokens. The runtime internally tracks richer data (reasoning_output_tokens, total_tokens) but the JSONL writer drops them before serialization. Cost is derived from published per-token pricing.

Cursor

Cursor's CLI does not document token or cost fields in its stream-json output format. The CLI bundle contains internal billing protobuf types (GetAggregatedUsageEventsResponse, requestCost, totalCents) but these are used for account-level dashboards, not per-run CLI output. Token usage comes from server pass-through in some execution paths. Cost is derived from published per-token pricing.

API Keys

Export the correct key before running. The harness passes it through to the agent container.

Variable	Agent
`ANTHROPIC_API_KEY`	Claude Code
`OPENAI_API_KEY`	Codex
`CODEX_API_KEY`	Codex (override)
`CURSOR_API_KEY`	Cursor

Models

Cost is derived from token usage and published per-token rates. The table below lists the models we have pricing data for. Models not listed here will still run but their cost will show as zero until pricing is added.

Model	Provider	Input	Cached Input	Output	Notes
`claude-sonnet-4-20250514`	Anthropic	—	—	—	Cost reported by CLI
`gpt-5.4`	OpenAI	$2.50	$0.25	$15.00	2x input / 1.5x output above 272k context
`gpt-5.1-codex-mini`	OpenAI	$0.25	$0.025	$2.00
`gpt-5-mini`	OpenAI	$0.25	$0.025	$2.00	Alias for codex-mini pricing
`composer-1.5`	Cursor	$3.50	$0.35	$17.50

All prices are per million tokens. Cache-creation and cache-read tokens use the same rate as their input-class counterpart unless the provider publishes a separate rate.

Default Models

Each agent has a default model used when no --model flag is passed:

Agent	Default Model
Claude Code	`claude-sonnet-4-20250514`
Codex	`gpt-5-codex`
Cursor	`composer-1.5`

Long-Context Pricing

gpt-5.4 applies multipliers when input exceeds 272,000 tokens:

Bucket	Standard	Long Context
Input	$2.50/M	$5.00/M
Cached Input	$0.25/M	$0.50/M
Output	$15.00/M	$22.50/M

Adding a Model

To add pricing for a new model:

Add the model name and any aliases to MODEL_ALIASES in scripts/llm-pricing.mjs.
Add per-million-token rates to MODEL_PRICING in the same file.
Run pnpm llm-metrics:test to verify the pricing logic.
If the model has special pricing tiers (like long-context multipliers), add longContextThreshold and the corresponding multiplier fields.

Example pricing entry

{
  "my-new-model": {
    "inputPerMillion": 3.00,
    "cachedInputPerMillion": 0.30,
    "outputPerMillion": 15.00
  }
}

How Cost Is Calculated

Each benchmark run stores an llmApiCostSource field that tells you exactly where the dollar figure came from.

Source	Meaning	Agents
`agent-reported`	Dollar cost emitted directly by the CLI	Claude Code
`derived-from-published-pricing`	Cost computed from stored token buckets and published per-token rates	Codex, Cursor

Token Buckets

Every run stores granular token counts when the agent reports them:

Field	Description
`inputTokens`	Prompt and context tokens sent to the model
`outputTokens`	Tokens generated by the model
`cachedInputTokens`	Input tokens billed at the cached-input rate
`cacheCreationTokens`	Tokens used to create a new cache entry
`cacheReadTokens`	Cache-hit tokens (provider-specific)
`cacheWriteTokens`	Cache-write tokens (provider-specific)
`tokensUsed`	Sum of all buckets above

tokensUsed is the headline number shown on the leaderboard. The individual buckets are visible in the audit detail view.

Harnesses

A harness defines the tooling environment the agent works in. Every scenario specifies which harness it runs against. All harnesses share the same base infrastructure (ClickHouse, Redpanda, Postgres) and differ only in the additional tools installed.

Harness	Tools	Network	Description
Base RT	None (base services only)	Open	Control group. Python, Node.js, and CLI access to the three foundational services.
Classic DE	Airflow 2.10, PySpark 3.5, dbt-core 1.10	Open	Standard data engineering toolkit. The agent wires together independent best-in-class tools.
OLAP for SWE	MooseStack (moose-cli 0.6)	Open	Software-engineering-first approach. Typed schemas, automated migrations, unified framework.

CLI Flags

Pass --harness to dec-bench build and dec-bench run to select a harness:

dec-bench build --scenario foo-bar-csv-ingest --harness classic-de --agent claude-code
dec-bench run   --scenario foo-bar-csv-ingest --harness classic-de

The default harness is base-rt.

Extending DEC Bench

Adding a Custom Agent

Agent runners live in docker/agents/<agent-id>/run.sh. To add a new agent:

Create docker/agents/<your-agent>/run.sh with the agent invocation logic.
The script receives the scenario prompt on stdin and environment variables for PERSONA, PLAN_MODE, EVAL_SCENARIO, EVAL_HARNESS, EVAL_AGENT, EVAL_VERSION, and MODEL.
Output structured JSON results to stdout using the DEC Bench marker protocol (see existing agents for the format).
Pass the agent ID to the CLI: dec-bench build --agent <your-agent> && dec-bench run --agent <your-agent>.

Adding a Custom Harness

Harness definitions live in apps/web/data/harnesses/<harness-id>.json. To add a new harness:

Create a JSON file with id, title, description, installScript, networkPolicy, and tools.
The installScript runs inside the container during dec-bench build. It should install any tooling the agent needs.
Register it with the CLI: dec-bench registry add --type harness --id <your-harness>.
Update scenario metadata to reference the new harness if needed.

Adding a Custom Scenario

See the authoring scenarios guide for the full walkthrough. The short version:

dec-bench create --scenario <your-scenario>
# Edit the generated files in scenarios/<your-scenario>/
dec-bench validate --scenario <your-scenario>
dec-bench build --scenario <your-scenario>
dec-bench run --scenario <your-scenario>