Supported Agents
Agents, models, harnesses, and how to extend DEC Bench.
DEC Bench supports three agent CLIs today. Each agent connects to its provider's API and reports token usage back to the harness after the run completes.
Agents
| Agent | Provider | CLI | Cost Source | Token Usage |
|---|---|---|---|---|
| Claude Code | Anthropic | claude | Agent-reported total_cost_usd | input_tokens, output_tokens, cache_creation_tokens, cache_read_tokens |
| Codex | OpenAI | codex exec --json | Derived from published pricing | input_tokens, cached_input_tokens, output_tokens via turn.completed |
| Cursor | Cursor | agent --print | Derived from published pricing | Server pass-through; not documented in CLI output schema |
Claude Code
Claude Code is the only agent whose CLI emits an explicit dollar cost (total_cost_usd). We store it directly and mark llmApiCostSource = "agent-reported". Token buckets are also available and stored for audit detail.
Codex
Codex emits token usage on each turn.completed JSONL event but does not emit dollar cost. The Usage struct in the open-source CLI contains three fields: input_tokens, cached_input_tokens, and output_tokens. The runtime internally tracks richer data (reasoning_output_tokens, total_tokens) but the JSONL writer drops them before serialization. Cost is derived from published per-token pricing.
Cursor
Cursor's CLI does not document token or cost fields in its stream-json output format. The CLI bundle contains internal billing protobuf types (GetAggregatedUsageEventsResponse, requestCost, totalCents) but these are used for account-level dashboards, not per-run CLI output. Token usage comes from server pass-through in some execution paths. Cost is derived from published per-token pricing.
API Keys
Export the correct key before running. The harness passes it through to the agent container.
| Variable | Agent |
|---|---|
ANTHROPIC_API_KEY | Claude Code |
OPENAI_API_KEY | Codex |
CODEX_API_KEY | Codex (override) |
CURSOR_API_KEY | Cursor |
Models
Cost is derived from token usage and published per-token rates. The table below lists the models we have pricing data for. Models not listed here will still run but their cost will show as zero until pricing is added.
| Model | Provider | Input | Cached Input | Output | Notes |
|---|---|---|---|---|---|
claude-sonnet-4-20250514 | Anthropic | — | — | — | Cost reported by CLI |
gpt-5.4 | OpenAI | $2.50 | $0.25 | $15.00 | 2x input / 1.5x output above 272k context |
gpt-5.1-codex-mini | OpenAI | $0.25 | $0.025 | $2.00 | |
gpt-5-mini | OpenAI | $0.25 | $0.025 | $2.00 | Alias for codex-mini pricing |
composer-1.5 | Cursor | $3.50 | $0.35 | $17.50 |
All prices are per million tokens. Cache-creation and cache-read tokens use the same rate as their input-class counterpart unless the provider publishes a separate rate.
Default Models
Each agent has a default model used when no --model flag is passed:
| Agent | Default Model |
|---|---|
| Claude Code | claude-sonnet-4-20250514 |
| Codex | gpt-5-codex |
| Cursor | composer-1.5 |
Long-Context Pricing
gpt-5.4 applies multipliers when input exceeds 272,000 tokens:
| Bucket | Standard | Long Context |
|---|---|---|
| Input | $2.50/M | $5.00/M |
| Cached Input | $0.25/M | $0.50/M |
| Output | $15.00/M | $22.50/M |
Adding a Model
To add pricing for a new model:
- Add the model name and any aliases to
MODEL_ALIASESinscripts/llm-pricing.mjs. - Add per-million-token rates to
MODEL_PRICINGin the same file. - Run
pnpm llm-metrics:testto verify the pricing logic. - If the model has special pricing tiers (like long-context multipliers), add
longContextThresholdand the corresponding multiplier fields.
{
"my-new-model": {
"inputPerMillion": 3.00,
"cachedInputPerMillion": 0.30,
"outputPerMillion": 15.00
}
}How Cost Is Calculated
Each benchmark run stores an llmApiCostSource field that tells you exactly where the dollar figure came from.
| Source | Meaning | Agents |
|---|---|---|
agent-reported | Dollar cost emitted directly by the CLI | Claude Code |
derived-from-published-pricing | Cost computed from stored token buckets and published per-token rates | Codex, Cursor |
Token Buckets
Every run stores granular token counts when the agent reports them:
| Field | Description |
|---|---|
inputTokens | Prompt and context tokens sent to the model |
outputTokens | Tokens generated by the model |
cachedInputTokens | Input tokens billed at the cached-input rate |
cacheCreationTokens | Tokens used to create a new cache entry |
cacheReadTokens | Cache-hit tokens (provider-specific) |
cacheWriteTokens | Cache-write tokens (provider-specific) |
tokensUsed | Sum of all buckets above |
tokensUsed is the headline number shown on the leaderboard. The individual buckets are visible in the audit detail view.
Harnesses
A harness defines the tooling environment the agent works in. Every scenario specifies which harness it runs against. All harnesses share the same base infrastructure (ClickHouse, Redpanda, Postgres) and differ only in the additional tools installed.
| Harness | Tools | Network | Description |
|---|---|---|---|
| Base RT | None (base services only) | Open | Control group. Python, Node.js, and CLI access to the three foundational services. |
| Classic DE | Airflow 2.10, PySpark 3.5, dbt-core 1.10 | Open | Standard data engineering toolkit. The agent wires together independent best-in-class tools. |
| OLAP for SWE | MooseStack (moose-cli 0.6) | Open | Software-engineering-first approach. Typed schemas, automated migrations, unified framework. |
CLI Flags
Pass --harness to dec-bench build and dec-bench run to select a harness:
dec-bench build --scenario foo-bar-csv-ingest --harness classic-de --agent claude-code
dec-bench run --scenario foo-bar-csv-ingest --harness classic-deThe default harness is base-rt.
Extending DEC Bench
Adding a Custom Agent
Agent runners live in docker/agents/<agent-id>/run.sh. To add a new agent:
- Create
docker/agents/<your-agent>/run.shwith the agent invocation logic. - The script receives the scenario prompt on stdin and environment variables for
PERSONA,PLAN_MODE,EVAL_SCENARIO,EVAL_HARNESS,EVAL_AGENT,EVAL_VERSION, andMODEL. - Output structured JSON results to stdout using the DEC Bench marker protocol (see existing agents for the format).
- Pass the agent ID to the CLI:
dec-bench build --agent <your-agent> && dec-bench run --agent <your-agent>.
Adding a Custom Harness
Harness definitions live in apps/web/data/harnesses/<harness-id>.json. To add a new harness:
- Create a JSON file with
id,title,description,installScript,networkPolicy, andtools. - The
installScriptruns inside the container duringdec-bench build. It should install any tooling the agent needs. - Register it with the CLI:
dec-bench registry add --type harness --id <your-harness>. - Update scenario metadata to reference the new harness if needed.
Adding a Custom Scenario
See the authoring scenarios guide for the full walkthrough. The short version:
dec-bench create --scenario <your-scenario>
# Edit the generated files in scenarios/<your-scenario>/
dec-bench validate --scenario <your-scenario>
dec-bench build --scenario <your-scenario>
dec-bench run --scenario <your-scenario>