§1 Five-Gate Evaluation Model
The evaluation model uses five sequential quality gates. Gates are strictly ordered: an agent must clear each gate before the next is evaluated. A solution that produces correct output but crashes on edge cases clears G2 (Correct) but fails at G3 (Robust). A solution that handles edge cases but misses latency targets clears G3 but fails at G4 (Performant).
This sequential structure is deliberate. Aggregate pass/fail scores hide where agents struggle. Gate attrition curves reveal the shape of quality: two agents with the same final score can have very different failure profiles.
§2 Evaluation Variables
Each benchmark run has three independent variables: the agent, the tooling harness, and the prompt variant. The full evaluation matrix captures every combination to isolate the effect of each. The dependent variables are quality and efficiency. Quality is captured as gate progression (which of the five gates the agent cleared) and normalized score. Efficiency is captured as token usage and LLM API cost, and time taken.
Agent:
The AI coding tool being evaluated. The current evaluation includes Claude Code (Opus 4.6, Sonnet 4.6), Codex (GPT-5.4), and Cursor (Composer 2).
Tooling Harness:
The tooling environment the agent works in. Three configurations are tested. Base infrastructure provides Postgres, ClickHouse, and Redpanda without additional tooling, measuring the agent’s first-principles reasoning. Classic DE adds dbt, Airflow, and Spark, representing traditional data engineering stacks. OLAP for SWE adds MooseStack with typed schemas and auto migrations, representing modern analytical tooling.
Prompt Variant:
How much domain knowledge the prompt provides. Each scenario has two conditions. The baseline prompt gives minimal context: no tool names, no implementation hints, and the agent figures out the approach on its own. The informed prompt provides domain-specific guidance: it names tools, specifies targets, and sets technical constraints.
§3 Scenarios
A scenario is a self-contained data engineering task. Each scenario defines its own infrastructure, seed data, starting state, and deterministic assertions. Scenarios span the Foo Bar synthetic SaaS analytics domain, covering ingestion, transformation, query optimization, schema design, streaming pipelines, storage optimization, and cross-system reconciliation.
Each scenario is assigned a difficulty tier based on the scope of infrastructure, number of tasks, and depth of reasoning required.
Difficulty Tier Definitions
Example Scenarios
§4 Methodology
Each benchmark run is a controlled experiment: one agent attempts one scenario under fixed conditions, and every action it takes is recorded.
4.1 Scenario Definition
Each scenario is a self-contained data engineering task. The agent starts inside an isolated Docker container with live infrastructure already running: databases with tables, streams with topics, seed data loaded. It receives a single natural-language prompt describing what to build or fix, and a set of TypeScript assertion functions that define success.
4.2 Execution Protocol and Infrastructure
A run pairs one agent, one tooling harness, and one prompt variant against one scenario. The agent operates autonomously: it may invoke tools, write code, query databases, and iterate, but there is no conversational back-and-forth with a human. All scenarios run against real, fully containerized infrastructure:
Every run produces a full structured trace: each reasoning step, tool call (shell commands, file edits, SQL queries), tool result, token count, and wall-clock timing is recorded and persisted alongside the scored result. These traces are the primary artifact for comparing how different agents approach the same task.
4.3 Evaluation Procedure
Gates are evaluated sequentially. Each gate contains two kinds of assertions: core checks shared across all scenarios (e.g. clean process exit, no credentials in committed code) that must all pass, and scenario-specific checks (e.g. query returns correct aggregates, data flows end-to-end across systems) graded as a group with a pass threshold of 80%. A gate clears only when both conditions are met.
4.4 Scoring Function
The normalized score is a step function with partial credit at the failure boundary. Fully cleared gates contribute equally; the first failed gate contributes its scenario-check pass rate as a fraction of one gate. The result is scaled to 0–1, so an agent that clears three of five gates and passes 60% of the fourth scores 0.72.
§5 Comparative Results
This section presents the benchmark results in three parts. First, how agents perform across the five quality gates. Then, how quality trades off against cost and time. Finally, scenario-level detail.
5.1 Task Completion
Attrition through the first four gates is gradual: 90% of runs clear G1 (functional), tapering to 78% at G4 (performant). The production gate is a cliff: only 15% of runs clear G5. Agent spreads at each gate are modest (7–14pp) and mostly not statistically significant overall, but on tier-2 scenarios agent choice becomes significant at early gates (p = 0.03). Within Claude, model choice (Opus 4.6 vs Sonnet 4.6) shows no meaningful difference at any gate (p > 0.3), with identical median normalized scores.
Interpretation
Of the 98 runs that reach G5, the single biggest failure is uses_env_vars: 67% hardcode database connection strings instead of reading from environment variables. Deep nesting (14%) and leftover debug artifacts (8%) account for most of the rest. The pattern is consistent: agents produce functional code that is brittle across environments and harder to maintain.
5.2 Harness Lift
The classic-de harness does not produce a statistically significant improvement in score when controlling for agent (stratified p = 0.08). Per-agent results diverge: Claude Code shows a significant score lift (0.973 to 0.987 median, p = 0.003) at higher cost ($0.25 to $0.32, p = 0.01). Codex trends similarly (0.973 to 0.987 median) with a large cost increase ($2.69 to $4.11) though neither reaches significance. Cursor moves in the opposite direction: slightly lower score (0.987 to 0.973) at slightly lower cost ($0.20 to $0.18), neither significant. Time is unaffected across all agents.
Interpretation
The harness helps Claude Code and Codex but slightly hurts Cursor’s performance. In both directions the effect is small. The cost story mirrors the score story: agents that improve with the harness also spend more tokens using it, while Cursor spends slightly fewer.
5.3 Cost and Efficiency
Three distinct price bands emerge: Cursor ($0.15–$0.20 median), Claude Code ($0.32–$0.45), and Codex ($2.71–$4.42). These bands hold across every gate level. Within each agent, harder scenarios cost 2–3× more (e.g. Claude Code: $0.24 at T1 vs $0.58 at T3; Codex: $2.57 vs $8.13). Time is tightly clustered: all agents take 2.0–2.8 minutes through G4 and 2.7–3.4 minutes at G5, a spread of only 1.3×. On T3, Claude Code scores highest (0.98) at $0.58, while Codex scores 0.58 at $8.13.
Interpretation
Spending more does not buy better results. The 18× cost spread between agents reflects token volume, not wall-clock effort or quality. Difficulty increases cost for every agent, but the between-agent price gap is the dominant factor. The scatter plot makes this visible: filtering by difficulty pushes all agents rightward (more expensive) and some downward (lower quality), but the three price-band clusters persist.
A cheap run that fails at G1 and an expensive run that clears G5 are not comparable. The table shows median cost and time only for runs that cleared at least a given gate, so each column compares like with like.
| Agent | All | G1+ | G4+ | G5+ |
|---|---|---|---|---|
| Codex | $3 n=29 | $3 n=24 | $3 n=21 | $4 n=6 |
| 2m 43s | 2m 48s | 2m 34s | 3m 26s | |
| Claude Code | $0.31 n=61 | $0.31 n=55 | $0.32 n=48 | $0.45 n=8 |
| 2m 4s | 1m 54s | 1m 59s | 2m 41s | |
| Cursor | $0.19 n=35 | $0.18 n=34 | $0.15 n=29 | $0.20 n=5 |
| 2m 12s | 2m 15s | 2m 9s | 2m 47s |
Each dot is one run. Score is the normalized DEC Bench score (0–1) on a log-compressed axis. Cost is LLM API spend (agent-reported or derived from published pricing); time is wall-clock duration.
§6 Limitations
[1] Jimenez, C.E., et al. SWE-bench: Can Language Models Resolve Real-World GitHub Issues? 2024.
[2] Ardent AI Labs. DE-Bench: A Benchmark for Data Engineering Tasks. 2025.
[3] dbt Labs. skill-eval: Evaluating LLM competency on dbt tasks. 2025.
[4] Significance tested with the Mann–Whitney U test (two-tailed), a non-parametric rank-sum test appropriate for small, non-normal samples.