Research Preview

DEC Bench: A multi-gate evaluation framework for AI coding agents on data engineering tasks.

LAST UPDATEDMarch 2026
VERSION0.1-preview
LICENSEMIT
SCENARIOS37
HARNESSES3
AGENTS3
Abstract

Benchmarks like SWE-bench[1] have set the standard for evaluating AI coding agents on software engineering tasks: patch generation, bug resolution, and repository-level reasoning. Early work on data engineering evaluation, including DE-Bench[2] and dbt Labs’ skill-eval[3], has begun to extend this coverage to schema design and transformation tasks. But data engineering requires a broader set of competencies: pipeline orchestration, streaming integration, query optimization, and data quality validation, often across multiple database systems within a single workflow.

DEC Bench is an open evaluation framework that scores AI coding agents on realistic data engineering scenarios. Each scenario runs against real infrastructure: OLTP (Postgres), OLAP (ClickHouse), and streaming (Redpanda), all inside isolated Docker containers, and agent output is validated against five sequential quality gates. The gates are ordered by increasing rigor: from “the code runs” to “you would ship this.” This sequential structure produces gate attrition curves that expose not just whether an agent succeeds, but where and how it falls off.

The benchmark is designed as an experiment with three independent variables (the agent, the tooling harness, and the prompt variant) and two dependent variables (quality and efficiency).

Across 125 preliminary runs, three findings stand out. First, agent choice only becomes statistically significant on hard scenarios, where the highest- and lowest-scoring agents differ by 56 percentage points (p < 0.01)[4]. Second, the basic tooling harness tested so far does not show a statistically significant effect (p = 0.08)[4]. Third, while 90% of runs clear the first quality gate, the highest-scoring agent still only clears the hardest gate 21% of the time.

These results are preliminary: sample sizes are small, only one harness configuration has been tested, prompt personas beyond the baseline have not yet been evaluated, and agent capabilities shift with each model update. See Limitations for details.

Contents

§1 Five-Gate Evaluation Model

The evaluation model uses five sequential quality gates. Gates are strictly ordered: an agent must clear each gate before the next is evaluated. A solution that produces correct output but crashes on edge cases clears G2 (Correct) but fails at G3 (Robust). A solution that handles edge cases but misses latency targets clears G3 but fails at G4 (Performant).

This sequential structure is deliberate. Aggregate pass/fail scores hide where agents struggle. Gate attrition curves reveal the shape of quality: two agents with the same final score can have very different failure profiles.

Note

Gates are TypeScript assertion functions defined per scenario. A gate clears when all core checks pass and at least 80% of scenario checks pass. Scores combine cleared gates with partial credit on the first failure.

Gate 01
FUNCTIONALThe code runs without errors
Gate 02
CORRECTIt produces expected output
Gate 03
ROBUSTIt handles edge cases and error conditions
Gate 04
PERFORMANTIt meets latency and throughput targets
Gate 05
PRODUCTIONCode quality, documentation, and operational readiness are fit for release

§2 Evaluation Variables

Each benchmark run has three independent variables: the agent, the tooling harness, and the prompt variant. The full evaluation matrix captures every combination to isolate the effect of each. The dependent variables are quality and efficiency. Quality is captured as gate progression (which of the five gates the agent cleared) and normalized score. Efficiency is captured as token usage and LLM API cost, and time taken.

01.

Agent:

The AI coding tool being evaluated. The current evaluation includes Claude Code (Opus 4.6, Sonnet 4.6), Codex (GPT-5.4), and Cursor (Composer 2).

02.

Tooling Harness:

The tooling environment the agent works in. Three configurations are tested. Base infrastructure provides Postgres, ClickHouse, and Redpanda without additional tooling, measuring the agent’s first-principles reasoning. Classic DE adds dbt, Airflow, and Spark, representing traditional data engineering stacks. OLAP for SWE adds MooseStack with typed schemas and auto migrations, representing modern analytical tooling.

03.

Prompt Variant:

How much domain knowledge the prompt provides. Each scenario has two conditions. The baseline prompt gives minimal context: no tool names, no implementation hints, and the agent figures out the approach on its own. The informed prompt provides domain-specific guidance: it names tools, specifies targets, and sets technical constraints.

Note

0.1-preview includes only a basic tooling harness and baseline prompts. Integrated harnesses and informed prompt variants are planned for future releases.

§3 Scenarios

A scenario is a self-contained data engineering task. Each scenario defines its own infrastructure, seed data, starting state, and deterministic assertions. Scenarios span the Foo Bar synthetic SaaS analytics domain, covering ingestion, transformation, query optimization, schema design, streaming pipelines, storage optimization, and cross-system reconciliation.

Each scenario is assigned a difficulty tier based on the scope of infrastructure, number of tasks, and depth of reasoning required.

Difficulty Tier Definitions

Note

The full benchmark includes 38 scenarios: 14 Tier 1, 19 Tier 2, and 5 Tier 3.

Difficulty Tier 1
FOCUSEDThe agent diagnoses and fixes a single, narrowly scoped problem.12 services of:PostgresPostgresClickHouseClickHouse
  • 14 scenarios, 3–5 assertions per gate
  • e.g. Broken Connection, CSV Ingest, Slow Queries, ORDER BY Optimization
Difficulty Tier 2
MODERATEThe agent must make meaningful architectural or design decisions.12 services of:PostgresPostgresClickHouseClickHouseRedpandaRedpanda
  • 19 scenarios, 5–10 assertions per gate
  • e.g. Stream to OLAP, Schema Evolution, Idempotent Pipeline
Difficulty Tier 3
COMPLEXMultiple interacting failure modes that demand the agent understand how systems compose.23+ services of:PostgresPostgresRedpandaRedpandaClickHouseClickHouse
  • 5 scenarios, 10+ assertions per gate
  • e.g. Full Pipeline Debug, Cross-System Reconciliation, OLTP to OLAP Migration

Example Scenarios

#
Scenario
Difficulty Tier
Services
Category
1
Broken Connection
T1
Postgres
Debugging
2
CSV Ingest
T1
ClickHouse
Ingestion
3
Stream to OLAP
T2
Redpanda, ClickHouse
Ingestion
4
Cross-System Reconciliation
T3
Postgres, Redpanda, ClickHouse
Debugging

§4 Methodology

Each benchmark run is a controlled experiment: one agent attempts one scenario under fixed conditions, and every action it takes is recorded.

4.1 Scenario Definition

Each scenario is a self-contained data engineering task. The agent starts inside an isolated Docker container with live infrastructure already running: databases with tables, streams with topics, seed data loaded. It receives a single natural-language prompt describing what to build or fix, and a set of TypeScript assertion functions that define success.

4.2 Execution Protocol and Infrastructure

A run pairs one agent, one tooling harness, and one prompt variant against one scenario. The agent operates autonomously: it may invoke tools, write code, query databases, and iterate, but there is no conversational back-and-forth with a human. All scenarios run against real, fully containerized infrastructure:

POSTGRES
POSTGRES

Transactional source of truth. Schema migrations, foreign keys, constraints.

CLICKHOUSE
CLICKHOUSE

Columnar analytics engine. Materialized views, partition keys, ORDER BY optimization.

REDPANDA
REDPANDA

Kafka-compatible event streaming. Topics, consumers, partitions.

Every run produces a full structured trace: each reasoning step, tool call (shell commands, file edits, SQL queries), tool result, token count, and wall-clock timing is recorded and persisted alongside the scored result. These traces are the primary artifact for comparing how different agents approach the same task.

4.3 Evaluation Procedure

Gates are evaluated sequentially. Each gate contains two kinds of assertions: core checks shared across all scenarios (e.g. clean process exit, no credentials in committed code) that must all pass, and scenario-specific checks (e.g. query returns correct aggregates, data flows end-to-end across systems) graded as a group with a pass threshold of 80%. A gate clears only when both conditions are met.

4.4 Scoring Function

The normalized score is a step function with partial credit at the failure boundary. Fully cleared gates contribute equally; the first failed gate contributes its scenario-check pass rate as a fraction of one gate. The result is scaled to 01, so an agent that clears three of five gates and passes 60% of the fourth scores 0.72.

§5 Comparative Results

This section presents the benchmark results in three parts. First, how agents perform across the five quality gates. Then, how quality trades off against cost and time. Finally, scenario-level detail.

5.1 Task Completion

Attrition through the first four gates is gradual: 90% of runs clear G1 (functional), tapering to 78% at G4 (performant). The production gate is a cliff: only 15% of runs clear G5. Agent spreads at each gate are modest (714pp) and mostly not statistically significant overall, but on tier-2 scenarios agent choice becomes significant at early gates (p = 0.03). Within Claude, model choice (Opus 4.6 vs Sonnet 4.6) shows no meaningful difference at any gate (p > 0.3), with identical median normalized scores.

Interpretation

Of the 98 runs that reach G5, the single biggest failure is uses_env_vars: 67% hardcode database connection strings instead of reading from environment variables. Deep nesting (14%) and leftover debug artifacts (8%) account for most of the rest. The pattern is consistent: agents produce functional code that is brittle across environments and harder to maintain.

Gate Attrition by AgentFor each gate, the percentage of that agent’s runs that cleared that gate level. Filter by scenario difficulty (T1 easiest, T3 hardest).
Scenario difficulty:
Claude Coden=61
Codexn=29
Cursorn=35
Note

Computed over 125 runs across 32 scenarios and 3 agents. Both harness configurations (base-rt and classic-de) are combined.

5.2 Harness Lift

The classic-de harness does not produce a statistically significant improvement in score when controlling for agent (stratified p = 0.08). Per-agent results diverge: Claude Code shows a significant score lift (0.973 to 0.987 median, p = 0.003) at higher cost ($0.25 to $0.32, p = 0.01). Codex trends similarly (0.973 to 0.987 median) with a large cost increase ($2.69 to $4.11) though neither reaches significance. Cursor moves in the opposite direction: slightly lower score (0.987 to 0.973) at slightly lower cost ($0.20 to $0.18), neither significant. Time is unaffected across all agents.

Interpretation

The harness helps Claude Code and Codex but slightly hurts Cursors performance. In both directions the effect is small. The cost story mirrors the score story: agents that improve with the harness also spend more tokens using it, while Cursor spends slightly fewer.

Harness Lift by AgentMedian normalized score and median cost/time per agent under base-rt and classic-de harnesses. Arrows show the shift from base to classic-de. Filter by scenario difficulty.
Gated score vs.
Scenario difficulty:
Claude Code+1%
0.97 base-rt (n=23)0.99 classic-de (n=38)
Codex+1%
0.97 base-rt (n=11)0.99 classic-de (n=18)
Cursor-1%
0.99 base-rt (n=15)0.97 classic-de (n=20)
Claude Code Base-RTn=23Claude Code Classic-DEn=38
Codex Base-RTn=11Codex Classic-DEn=18
Cursor Base-RTn=15Cursor Classic-DEn=20
Note

In 0.1-preview, harness coverage is uneven across tiers: T1 scenarios were only run with base-rt, T3 only with classic-de. Only T2 and the pooled view have both harnesses represented, and even there base-rt sample sizes are small (12 runs per agent). Balancing coverage is a priority for the next release.

5.3 Cost and Efficiency

Three distinct price bands emerge: Cursor ($0.15$0.20 median), Claude Code ($0.32$0.45), and Codex ($2.71$4.42). These bands hold across every gate level. Within each agent, harder scenarios cost 23× more (e.g. Claude Code: $0.24 at T1 vs $0.58 at T3; Codex: $2.57 vs $8.13). Time is tightly clustered: all agents take 2.02.8 minutes through G4 and 2.73.4 minutes at G5, a spread of only 1.3×. On T3, Claude Code scores highest (0.98) at $0.58, while Codex scores 0.58 at $8.13.

Interpretation

Spending more does not buy better results. The 18× cost spread between agents reflects token volume, not wall-clock effort or quality. Difficulty increases cost for every agent, but the between-agent price gap is the dominant factor. The scatter plot makes this visible: filtering by difficulty pushes all agents rightward (more expensive) and some downward (lower quality), but the three price-band clusters persist.

A cheap run that fails at G1 and an expensive run that clears G5 are not comparable. The table shows median cost and time only for runs that cleared at least a given gate, so each column compares like with like.

AgentAllG1+G4+G5+
Codex$3 n=29$3 n=24$3 n=21$4 n=6
2m 43s2m 48s2m 34s3m 26s
Claude Code$0.31 n=61$0.31 n=55$0.32 n=48$0.45 n=8
2m 4s1m 54s1m 59s2m 41s
Cursor$0.19 n=35$0.18 n=34$0.15 n=29$0.20 n=5
2m 12s2m 15s2m 9s2m 47s
Quality vs. Efficiency

Each dot is one run. Score is the normalized DEC Bench score (01) on a log-compressed axis. Cost is LLM API spend (agent-reported or derived from published pricing); time is wall-clock duration.

Score vs.
Scenario difficulty:
Claude Coden=59
Codexn=29
Cursorn=35
Note

The y-axis is log-compressed near 1.0 to visually separate runs that cleared G4 (score 0.910.99) from those that cleared G5 (score 1.0). Top-left is high quality at low cost/time.

§6 Limitations

6.1 SAMPLE SIZE

125 total runs across three agents, with 29–61 runs per agent. Tier-level breakdowns are smaller: T3 (hard) has only 2–6 runs per agent, limiting the confidence of difficulty-stratified findings. The statistical claims in this release (e.g. agent significance on hard scenarios) should be treated as directional.

6.2 HARNESS AND PROMPT COVERAGE

Only one harness pair has been tested (base-rt vs classic-de), and harness coverage is uneven (76 vs 49 runs). No prompt variant beyond the baseline has been evaluated. The finding that harness lift is not significant may change with more integrated tooling or informed prompts.

6.3 AGENT SELECTION

Three agents were tested: Claude Code, Codex, and Cursor. The cost–quality relationship observed (spending more does not buy better results) is based on this specific set and may not generalize to other agents or pricing models.

6.4 VERSION SENSITIVITY

Agent capabilities change with each model update. Results reflect the specific model versions tested (e.g. Claude Opus 4.6, Sonnet 4.6) and may not hold for future releases.

6.5 GATE BOUNDARY SUBJECTIVITY

The five-gate model imposes discrete quality levels. Performance within a gate is not captured by the gate label alone. The normalized score provides finer resolution but still compresses scenario-specific nuance into a single number.

6.6 DOMAIN COVERAGE

36 of 37 scenarios use the Foo Bar synthetic domain; one uses an e-commerce domain. Real-world data engineering spans a wider range of systems, schemas, and failure modes than is currently represented.

§7 Evaluation Access

OPEN BENCHMARK

Reproduce Our Results

DEC Bench is open source and fully containerized. Clone the repository, run the evaluation suite against your preferred agent, and reproduce every result reported here.

RESEARCH PREVIEW

Contribute to the Benchmark

We invite contributions across three dimensions: running the evaluation against additional agents, developing new scenarios, and extending the methodology to adjacent domains.

References

[1] Jimenez, C.E., et al. SWE-bench: Can Language Models Resolve Real-World GitHub Issues? 2024.

[2] Ardent AI Labs. DE-Bench: A Benchmark for Data Engineering Tasks. 2025.

[3] dbt Labs. skill-eval: Evaluating LLM competency on dbt tasks. 2025.

[4] Significance tested with the Mann–Whitney U test (two-tailed), a non-parametric rank-sum test appropriate for small, non-normal samples.