Documentation

Supported Agents

Agents, models, harnesses, and how to extend DEC Bench.

DEC Bench supports three agent CLIs today. Each agent connects to its provider's API and reports token usage back to the harness after the run completes.

Agents

AgentProviderCLICost SourceToken Usage
Claude CodeAnthropicclaudeAgent-reported total_cost_usdinput_tokens, output_tokens, cache_creation_tokens, cache_read_tokens
CodexOpenAIcodex exec --jsonDerived from published pricinginput_tokens, cached_input_tokens, output_tokens via turn.completed
CursorCursoragent --printDerived from published pricingServer pass-through; not documented in CLI output schema

Claude Code

Claude Code is the only agent whose CLI emits an explicit dollar cost (total_cost_usd). We store it directly and mark llmApiCostSource = "agent-reported". Token buckets are also available and stored for audit detail.

Codex

Codex emits token usage on each turn.completed JSONL event but does not emit dollar cost. The Usage struct in the open-source CLI contains three fields: input_tokens, cached_input_tokens, and output_tokens. The runtime internally tracks richer data (reasoning_output_tokens, total_tokens) but the JSONL writer drops them before serialization. Cost is derived from published per-token pricing.

Cursor

Cursor's CLI does not document token or cost fields in its stream-json output format. The CLI bundle contains internal billing protobuf types (GetAggregatedUsageEventsResponse, requestCost, totalCents) but these are used for account-level dashboards, not per-run CLI output. Token usage comes from server pass-through in some execution paths. Cost is derived from published per-token pricing.

API Keys

Export the correct key before running. The harness passes it through to the agent container.

VariableAgent
ANTHROPIC_API_KEYClaude Code
OPENAI_API_KEYCodex
CODEX_API_KEYCodex (override)
CURSOR_API_KEYCursor

Models

Cost is derived from token usage and published per-token rates. The table below lists the models we have pricing data for. Models not listed here will still run but their cost will show as zero until pricing is added.

ModelProviderInputCached InputOutputNotes
claude-sonnet-4-20250514AnthropicCost reported by CLI
gpt-5.4OpenAI$2.50$0.25$15.002x input / 1.5x output above 272k context
gpt-5.1-codex-miniOpenAI$0.25$0.025$2.00
gpt-5-miniOpenAI$0.25$0.025$2.00Alias for codex-mini pricing
composer-1.5Cursor$3.50$0.35$17.50

All prices are per million tokens. Cache-creation and cache-read tokens use the same rate as their input-class counterpart unless the provider publishes a separate rate.

Default Models

Each agent has a default model used when no --model flag is passed:

AgentDefault Model
Claude Codeclaude-sonnet-4-20250514
Codexgpt-5-codex
Cursorcomposer-1.5

Long-Context Pricing

gpt-5.4 applies multipliers when input exceeds 272,000 tokens:

BucketStandardLong Context
Input$2.50/M$5.00/M
Cached Input$0.25/M$0.50/M
Output$15.00/M$22.50/M

Adding a Model

To add pricing for a new model:

  1. Add the model name and any aliases to MODEL_ALIASES in scripts/llm-pricing.mjs.
  2. Add per-million-token rates to MODEL_PRICING in the same file.
  3. Run pnpm llm-metrics:test to verify the pricing logic.
  4. If the model has special pricing tiers (like long-context multipliers), add longContextThreshold and the corresponding multiplier fields.
Example pricing entry
{
  "my-new-model": {
    "inputPerMillion": 3.00,
    "cachedInputPerMillion": 0.30,
    "outputPerMillion": 15.00
  }
}

How Cost Is Calculated

Each benchmark run stores an llmApiCostSource field that tells you exactly where the dollar figure came from.

SourceMeaningAgents
agent-reportedDollar cost emitted directly by the CLIClaude Code
derived-from-published-pricingCost computed from stored token buckets and published per-token ratesCodex, Cursor

Token Buckets

Every run stores granular token counts when the agent reports them:

FieldDescription
inputTokensPrompt and context tokens sent to the model
outputTokensTokens generated by the model
cachedInputTokensInput tokens billed at the cached-input rate
cacheCreationTokensTokens used to create a new cache entry
cacheReadTokensCache-hit tokens (provider-specific)
cacheWriteTokensCache-write tokens (provider-specific)
tokensUsedSum of all buckets above

tokensUsed is the headline number shown on the leaderboard. The individual buckets are visible in the audit detail view.

Harnesses

A harness defines the tooling environment the agent works in. Every scenario specifies which harness it runs against. All harnesses share the same base infrastructure (ClickHouse, Redpanda, Postgres) and differ only in the additional tools installed.

HarnessToolsNetworkDescription
Base RTNone (base services only)OpenControl group. Python, Node.js, and CLI access to the three foundational services.
Classic DEAirflow 2.10, PySpark 3.5, dbt-core 1.10OpenStandard data engineering toolkit. The agent wires together independent best-in-class tools.
OLAP for SWEMooseStack (moose-cli 0.6)OpenSoftware-engineering-first approach. Typed schemas, automated migrations, unified framework.

CLI Flags

Pass --harness to dec-bench build and dec-bench run to select a harness:

dec-bench build --scenario foo-bar-csv-ingest --harness classic-de --agent claude-code
dec-bench run   --scenario foo-bar-csv-ingest --harness classic-de

The default harness is base-rt.

Extending DEC Bench

Adding a Custom Agent

Agent runners live in docker/agents/<agent-id>/run.sh. To add a new agent:

  1. Create docker/agents/<your-agent>/run.sh with the agent invocation logic.
  2. The script receives the scenario prompt on stdin and environment variables for PERSONA, PLAN_MODE, EVAL_SCENARIO, EVAL_HARNESS, EVAL_AGENT, EVAL_VERSION, and MODEL.
  3. Output structured JSON results to stdout using the DEC Bench marker protocol (see existing agents for the format).
  4. Pass the agent ID to the CLI: dec-bench build --agent <your-agent> && dec-bench run --agent <your-agent>.

Adding a Custom Harness

Harness definitions live in apps/web/data/harnesses/<harness-id>.json. To add a new harness:

  1. Create a JSON file with id, title, description, installScript, networkPolicy, and tools.
  2. The installScript runs inside the container during dec-bench build. It should install any tooling the agent needs.
  3. Register it with the CLI: dec-bench registry add --type harness --id <your-harness>.
  4. Update scenario metadata to reference the new harness if needed.

Adding a Custom Scenario

See the authoring scenarios guide for the full walkthrough. The short version:

dec-bench create --scenario <your-scenario>
# Edit the generated files in scenarios/<your-scenario>/
dec-bench validate --scenario <your-scenario>
dec-bench build --scenario <your-scenario>
dec-bench run --scenario <your-scenario>