Abstract

Benchmarks like SWE-bench^[1] have set the standard for evaluating AI coding agents on software engineering tasks: patch generation, bug resolution, and repository-level reasoning. Early work on data engineering evaluation, including DE-Bench^[2] and dbt Labs’ skill-eval^[3], has begun to extend this coverage to schema design and transformation tasks. But data engineering requires a broader set of competencies: pipeline orchestration, streaming integration, query optimization, and data quality validation, often across multiple database systems within a single workflow.

DEC Bench is an open evaluation framework that scores AI coding agents on realistic data engineering scenarios. Each scenario runs against real infrastructure: OLTP (Postgres), OLAP (ClickHouse), and streaming (Redpanda), all inside isolated Docker containers, and agent output is validated against five sequential quality gates. The gates are ordered by increasing rigor: from “the code runs” to “you would ship this.” This sequential structure produces gate attrition curves that expose not just whether an agent succeeds, but where and how it falls off.

The benchmark is designed as an experiment with three independent variables (the agent, the tooling harness, and the prompt variant) and two dependent variables (quality and efficiency).

Across 125 preliminary runs, three findings stand out. First, agent choice only becomes statistically significant on hard scenarios, where the highest- and lowest-scoring agents differ by 56 percentage points (p < 0.01)^[4]. Second, the basic tooling harness tested so far does not show a statistically significant effect (p = 0.08)^[4]. Third, while 90% of runs clear the first quality gate, the highest-scoring agent still only clears the hardest gate 21% of the time.

These results are preliminary: sample sizes are small, only one harness configuration has been tested, prompt personas beyond the baseline have not yet been evaluated, and agent capabilities shift with each model update. See Limitations for details.

Contents

§1Five-Gate Evaluation Model §2Evaluation Variables §3Scenarios §4Methodology §5Comparative Results §6Limitations §7Evaluation Access

§1 Five-Gate Evaluation Model

The evaluation model uses five sequential quality gates. Gates are strictly ordered: an agent must clear each gate before the next is evaluated. A solution that produces correct output but crashes on edge cases clears G2 (Correct) but fails at G3 (Robust). A solution that handles edge cases but misses latency targets clears G3 but fails at G4 (Performant).

This sequential structure is deliberate. Aggregate pass/fail scores hide where agents struggle. Gate attrition curves reveal the shape of quality: two agents with the same final score can have very different failure profiles.

§2 Evaluation Variables

Each benchmark run has three independent variables: the agent, the tooling harness, and the prompt variant. The full evaluation matrix captures every combination to isolate the effect of each. The dependent variables are quality and efficiency. Quality is captured as gate progression (which of the five gates the agent cleared) and normalized score. Efficiency is captured as token usage and LLM API cost, and time taken.

01.

Agent:

The AI coding tool being evaluated. The current evaluation includes Claude Code (Opus 4.6, Sonnet 4.6), Codex (GPT-5.4), and Cursor (Composer 2).

02.

Tooling Harness:

The tooling environment the agent works in. Three configurations are tested. Base infrastructure provides Postgres, ClickHouse, and Redpanda without additional tooling, measuring the agent’s first-principles reasoning. Classic DE adds dbt, Airflow, and Spark, representing traditional data engineering stacks. OLAP for SWE adds MooseStack with typed schemas and auto migrations, representing modern analytical tooling.

03.

Prompt Variant:

How much domain knowledge the prompt provides. Each scenario has two conditions. The baseline prompt gives minimal context: no tool names, no implementation hints, and the agent figures out the approach on its own. The informed prompt provides domain-specific guidance: it names tools, specifies targets, and sets technical constraints.

§3 Scenarios

A scenario is a self-contained data engineering task. Each scenario defines its own infrastructure, seed data, starting state, and deterministic assertions. Scenarios span the Foo Bar synthetic SaaS analytics domain, covering ingestion, transformation, query optimization, schema design, streaming pipelines, storage optimization, and cross-system reconciliation.

Each scenario is assigned a difficulty tier based on the scope of infrastructure, number of tasks, and depth of reasoning required.

Difficulty Tier Definitions

Difficulty Tier 1

FOCUSEDThe agent diagnoses and fixes a single, narrowly scoped problem.1–2 services of:

Postgres

ClickHouse

•14 scenarios, 3–5 assertions per gate
•e.g. Broken Connection, CSV Ingest, Slow Queries, ORDER BY Optimization

Example Scenarios

Scenario

Difficulty Tier

Services

§4 Methodology

Each benchmark run is a controlled experiment: one agent attempts one scenario under fixed conditions, and every action it takes is recorded.

4.1 Scenario Definition

Each scenario is a self-contained data engineering task. The agent starts inside an isolated Docker container with live infrastructure already running: databases with tables, streams with topics, seed data loaded. It receives a single natural-language prompt describing what to build or fix, and a set of TypeScript assertion functions that define success.

4.2 Execution Protocol and Infrastructure

A run pairs one agent, one tooling harness, and one prompt variant against one scenario. The agent operates autonomously: it may invoke tools, write code, query databases, and iterate, but there is no conversational back-and-forth with a human. All scenarios run against real, fully containerized infrastructure:

Every run produces a full structured trace: each reasoning step, tool call (shell commands, file edits, SQL queries), tool result, token count, and wall-clock timing is recorded and persisted alongside the scored result. These traces are the primary artifact for comparing how different agents approach the same task.

4.3 Evaluation Procedure

Gates are evaluated sequentially. Each gate contains two kinds of assertions: core checks shared across all scenarios (e.g. clean process exit, no credentials in committed code) that must all pass, and scenario-specific checks (e.g. query returns correct aggregates, data flows end-to-end across systems) graded as a group with a pass threshold of 80%. A gate clears only when both conditions are met.

4.4 Scoring Function

The normalized score is a step function with partial credit at the failure boundary. Fully cleared gates contribute equally; the first failed gate contributes its scenario-check pass rate as a fraction of one gate. The result is scaled to 0–1, so an agent that clears three of five gates and passes 60% of the fourth scores 0.72.

View Repository

§5 Comparative Results

This section presents the benchmark results in three parts. First, how agents perform across the five quality gates. Then, how quality trades off against cost and time. Finally, scenario-level detail.

5.1 Task Completion

Attrition through the first four gates is gradual: 90% of runs clear G1 (functional), tapering to 78% at G4 (performant). The production gate is a cliff: only 15% of runs clear G5. Agent spreads at each gate are modest (7–14pp) and mostly not statistically significant overall, but on tier-2 scenarios agent choice becomes significant at early gates (p = 0.03). Within Claude, model choice (Opus 4.6 vs Sonnet 4.6) shows no meaningful difference at any gate (p > 0.3), with identical median normalized scores.

Interpretation

Of the 98 runs that reach G5, the single biggest failure is uses_env_vars: 67% hardcode database connection strings instead of reading from environment variables. Deep nesting (14%) and leftover debug artifacts (8%) account for most of the rest. The pattern is consistent: agents produce functional code that is brittle across environments and harder to maintain.

Gate Attrition by AgentFor each gate, the percentage of that agent’s runs that cleared that gate level. Filter by scenario difficulty (T1 easiest, T3 hardest).

Scenario difficulty:

Claude Coden=61

Codexn=29

Cursorn=35

5.2 Harness Lift

The classic-de harness does not produce a statistically significant improvement in score when controlling for agent (stratified p = 0.08). Per-agent results diverge: Claude Code shows a significant score lift (0.973 to 0.987 median, p = 0.003) at higher cost ($0.25 to $0.32, p = 0.01). Codex trends similarly (0.973 to 0.987 median) with a large cost increase ($2.69 to $4.11) though neither reaches significance. Cursor moves in the opposite direction: slightly lower score (0.987 to 0.973) at slightly lower cost ($0.20 to $0.18), neither significant. Time is unaffected across all agents.

Interpretation

The harness helps Claude Code and Codex but slightly hurts Cursor’s performance. In both directions the effect is small. The cost story mirrors the score story: agents that improve with the harness also spend more tokens using it, while Cursor spends slightly fewer.

Harness Lift by AgentMedian normalized score and median cost/time per agent under base-rt and classic-de harnesses. Arrows show the shift from base to classic-de. Filter by scenario difficulty.

Gated score vs.

Scenario difficulty:

Claude Code+1%

0.97 base-rt (n=23)0.99 classic-de (n=38)

Codex+1%

0.97 base-rt (n=11)0.99 classic-de (n=18)

Cursor-1%

0.99 base-rt (n=15)0.97 classic-de (n=20)

Claude Code Base-RTn=23Claude Code Classic-DEn=38

Codex Base-RTn=11Codex Classic-DEn=18

Cursor Base-RTn=15Cursor Classic-DEn=20

5.3 Cost and Efficiency

Three distinct price bands emerge: Cursor ($0.15–$0.20 median), Claude Code ($0.32–$0.45), and Codex ($2.71–$4.42). These bands hold across every gate level. Within each agent, harder scenarios cost 2–3× more (e.g. Claude Code: $0.24 at T1 vs $0.58 at T3; Codex: $2.57 vs $8.13). Time is tightly clustered: all agents take 2.0–2.8 minutes through G4 and 2.7–3.4 minutes at G5, a spread of only 1.3×. On T3, Claude Code scores highest (0.98) at $0.58, while Codex scores 0.58 at $8.13.

Interpretation

Spending more does not buy better results. The 18× cost spread between agents reflects token volume, not wall-clock effort or quality. Difficulty increases cost for every agent, but the between-agent price gap is the dominant factor. The scatter plot makes this visible: filtering by difficulty pushes all agents rightward (more expensive) and some downward (lower quality), but the three price-band clusters persist.

A cheap run that fails at G1 and an expensive run that clears G5 are not comparable. The table shows median cost and time only for runs that cleared at least a given gate, so each column compares like with like.

Agent	All	G1+	G4+	G5+
Codex	$3 n=29	$3 n=24	$3 n=21	$4 n=6
Codex	2m 43s	2m 48s	2m 34s	3m 26s
Claude Code	$0.31 n=61	$0.31 n=55	$0.32 n=48	$0.45 n=8
Claude Code	2m 4s	1m 54s	1m 59s	2m 41s
Cursor	$0.19 n=35	$0.18 n=34	$0.15 n=29	$0.20 n=5
Cursor	2m 12s	2m 15s	2m 9s	2m 47s

Quality vs. Efficiency

Each dot is one run. Score is the normalized DEC Bench score (0–1) on a log-compressed axis. Cost is LLM API spend (agent-reported or derived from published pricing); time is wall-clock duration.

Score vs.

Scenario difficulty:

Claude Coden=59

Codexn=29

Cursorn=35

Explore the Leaderboard

§6 Limitations

References

[1] Jimenez, C.E., et al. SWE-bench: Can Language Models Resolve Real-World GitHub Issues? 2024.

[2] Ardent AI Labs. DE-Bench: A Benchmark for Data Engineering Tasks. 2025.

[3] dbt Labs. skill-eval: Evaluating LLM competency on dbt tasks. 2025.

[4] Significance tested with the Mann–Whitney U test (two-tailed), a non-parametric rank-sum test appropriate for small, non-normal samples.

DEC Bench: A multi-gate evaluation framework for AI coding agents on data engineering tasks.

§1 Five-Gate Evaluation Model

§2 Evaluation Variables

Agent:

Tooling Harness:

Prompt Variant:

§3 Scenarios

Difficulty Tier Definitions

Example Scenarios

§4 Methodology

4.1 Scenario Definition

4.2 Execution Protocol and Infrastructure

4.3 Evaluation Procedure

4.4 Scoring Function

§5 Comparative Results

5.1 Task Completion

5.2 Harness Lift

5.3 Cost and Efficiency

§6 Limitations

§7 Evaluation Access

Reproduce Our Results

Contribute to the Benchmark