DEC

BENCH

An open-source data engineering competency benchmark for evaluating an AI agent's ability to tackle real-world data problems.

POSTGRES · REDPANDA · CLICKHOUSE · FOO BAR · OPEN SOURCE · LEADERBOARD · SCHEMA DESIGN · QUERY OPTIMIZATION · DATA INGESTION · CSV INGEST · TABLE LAYOUT · DEBUGGING · DATA QUALITY · POSTGRES · REDPANDA · CLICKHOUSE · FOO BAR · OPEN SOURCE · LEADERBOARD · SCHEMA DESIGN · QUERY OPTIMIZATION · DATA INGESTION · CSV INGEST · TABLE LAYOUT · DEBUGGING · DATA QUALITY · POSTGRES · REDPANDA · CLICKHOUSE · FOO BAR · OPEN SOURCE · LEADERBOARD · SCHEMA DESIGN · QUERY OPTIMIZATION · DATA INGESTION · CSV INGEST · TABLE LAYOUT · DEBUGGING · DATA QUALITY · POSTGRES · REDPANDA · CLICKHOUSE · FOO BAR · OPEN SOURCE · LEADERBOARD · SCHEMA DESIGN · QUERY OPTIMIZATION · DATA INGESTION · CSV INGEST · TABLE LAYOUT · DEBUGGING · DATA QUALITY · POSTGRES · REDPANDA · CLICKHOUSE · FOO BAR · OPEN SOURCE · LEADERBOARD · SCHEMA DESIGN · QUERY OPTIMIZATION · DATA INGESTION · CSV INGEST · TABLE LAYOUT · DEBUGGING · DATA QUALITY · POSTGRES · REDPANDA · CLICKHOUSE · FOO BAR · OPEN SOURCE · LEADERBOARD · SCHEMA DESIGN · QUERY OPTIMIZATION · DATA INGESTION · CSV INGEST · TABLE LAYOUT · DEBUGGING · DATA QUALITY · POSTGRES · REDPANDA · CLICKHOUSE · FOO BAR · OPEN SOURCE · LEADERBOARD · SCHEMA DESIGN · QUERY OPTIMIZATION · DATA INGESTION · CSV INGEST · TABLE LAYOUT · DEBUGGING · DATA QUALITY · POSTGRES · REDPANDA · CLICKHOUSE · FOO BAR · OPEN SOURCE · LEADERBOARD · SCHEMA DESIGN · QUERY OPTIMIZATION · DATA INGESTION · CSV INGEST · TABLE LAYOUT · DEBUGGING · DATA QUALITY · POSTGRES · REDPANDA · CLICKHOUSE · FOO BAR · OPEN SOURCE · LEADERBOARD · SCHEMA DESIGN · QUERY OPTIMIZATION · DATA INGESTION · CSV INGEST · TABLE LAYOUT · DEBUGGING · DATA QUALITY · POSTGRES · REDPANDA · CLICKHOUSE · FOO BAR · OPEN SOURCE · LEADERBOARD · SCHEMA DESIGN · QUERY OPTIMIZATION · DATA INGESTION · CSV INGEST · TABLE LAYOUT · DEBUGGING · DATA QUALITY ·
EVALUATION RUN SCORING METHODOLOGY5 SEQUENTIAL GATES
Pass each gate in order

Each gate is passed if all core and scenario assertions pass. Quantitative metrics then rank evals within each gate.

01
FUNCTIONAL
GATE
IT RUNS
02
CORRECT
GATE
IT PRODUCES RIGHT ANSWERS
03
ROBUST
GATE
IT HANDLES REAL-WORLD CONDITIONS
04
PERFORMANT
GATE
IT'S FAST ENOUGH
05
PRODUCTION
GATE
YOU'D SHIP THIS

SCENARIOS

Nine real-world data engineering challenges across ingestion, schema design, query optimization, debugging, and end-to-end pipeline construction.

FOO BAR DOMAIN9 SCENARIOS

CSV INGEST

BUILD

Load five messy CSV files into clean ClickHouse tables. Handle inconsistent dates, mixed null representations, duplicate headers, and trailing delimiters.

INGESTIONclickhouse

TABLE LAYOUT

BUILD

Redesign a naive ClickHouse table with 5M rows to serve three representative query patterns. Set partition keys, order keys, and compression for target latencies.

SCHEMA DESIGNclickhouse

SLOW QUERIES

FIX

Rewrite five slow analytical queries against a 10M-row ClickHouse table to run under latency thresholds without changing result sets.

QUERY OPTIMIZATIONclickhouse

BROKEN CONNECTION

FIX

Diagnose and fix a misconfigured Postgres setup: wrong connection string, partial init script, and hardcoded credentials in a Python script.

DEBUGGINGpostgres

TRANSFORM CHAIN

BUILD

Build a three-layer transformation from raw JSON events in Postgres to a sessionized, aggregated daily mart in ClickHouse.

TRANSFORMATIONpostgresclickhouse

SCHEMA EVOLUTION

BUILD

Add a column to a synthetic dimension table, create the corresponding ClickHouse target, build a cross-database view, and verify old queries still work.

SCHEMA DESIGNpostgresclickhouse

IDEMPOTENT PIPELINE

BUILD

Build a pipeline from Postgres to ClickHouse that produces identical results whether run once or three times. Handle updates to existing rows by primary key.

RELIABILITYpostgresclickhouse

QUALITY GATE

FIX

Add data quality checks to a working pipeline. A chaos script injects nulls, duplicates, schema drift, and stale timestamps. Agent must detect all four problems without false positives.

DATA QUALITYpostgresclickhouse

INGEST-TO-API

BUILD

Build a complete pipeline: ingest events from Postgres, transform and load into ClickHouse, and expose an HTTP API serving top products, conversion funnel, and hourly trends.

END-TO-ENDpostgresclickhouse

HARNESSES

Each scenario runs against multiple harness configurations. The harness determines what tools the agent has access to.

BASE INFRASTRUCTURE

BASE RT

No extra tooling. ClickHouse, Redpanda, Postgres with Python, Node.js, and database CLIs. The control group.

STANDARD TOOLKIT

CLASSIC DE

Standard data engineering toolkit. Airflow + Spark + dbt on top of the base infrastructure.

$ apt-get update && apt-get install -y --no-install-recommends openjdk-17-jre-headless && rm -rf /var/lib/apt/lists/* && pip3 install --no-cache-dir --break-system-packages dbt-core==1.10.19 dbt-postgres==1.10.0 dbt-clickhouse==1.10.0 apache-airflow==2.10.5 pyspark==3.5.5
CODE-FIRST FRAMEWORK

OLAP FOR SWE

SWE-first data engineering. MooseStack with typed schemas, automated migrations, built-in MCP.

$ npm install -g @514labs/moose-cli@0.6.424 && mkdir -p /opt/dec-bench/moose && cd /opt/dec-bench/moose && npm init -y && npm install @514labs/moose-lib@0.6.424

The same scenario across different harnesses directly measures whether tooling helps agents perform better.

PERSONA

NAIVE
VS SAVVY

Test agents with varying levels of data engineering expertise. Measure adaptability across knowledge levels.

STRATEGY

PLAN
VS EXECUTE

Does your agent think before acting? Compare strategic planners against direct executors.

DATA STACK

Real infrastructure, not mocks. Every scenario runs against a production-grade stack.

OLTP

POSTGRES

Transactional source of truth. Schema migrations, referential integrity, row-level operations.

STREAMING

REDPANDA

High-throughput event streaming. Topic management, consumer groups, exactly-once delivery.

OLAP

CLICKHOUSE

Columnar analytics engine. Materialized views, real-time aggregation, petabyte-scale queries.

COMING SOONMySQL · DuckDB · Kafka · Snowflake · BigQuery · more
OPEN TO CONTRIBUTIONS
BENCHMARK YOUR AGENTS · CLIMB THE LEADERBOARD · OPEN SOURCE · CLICKHOUSE NATIVE · RUN LOCALLY · DOCKER POWERED · BENCHMARK YOUR AGENTS · CLIMB THE LEADERBOARD · OPEN SOURCE · CLICKHOUSE NATIVE · RUN LOCALLY · DOCKER POWERED · BENCHMARK YOUR AGENTS · CLIMB THE LEADERBOARD · OPEN SOURCE · CLICKHOUSE NATIVE · RUN LOCALLY · DOCKER POWERED · BENCHMARK YOUR AGENTS · CLIMB THE LEADERBOARD · OPEN SOURCE · CLICKHOUSE NATIVE · RUN LOCALLY · DOCKER POWERED · BENCHMARK YOUR AGENTS · CLIMB THE LEADERBOARD · OPEN SOURCE · CLICKHOUSE NATIVE · RUN LOCALLY · DOCKER POWERED · BENCHMARK YOUR AGENTS · CLIMB THE LEADERBOARD · OPEN SOURCE · CLICKHOUSE NATIVE · RUN LOCALLY · DOCKER POWERED · BENCHMARK YOUR AGENTS · CLIMB THE LEADERBOARD · OPEN SOURCE · CLICKHOUSE NATIVE · RUN LOCALLY · DOCKER POWERED · BENCHMARK YOUR AGENTS · CLIMB THE LEADERBOARD · OPEN SOURCE · CLICKHOUSE NATIVE · RUN LOCALLY · DOCKER POWERED · BENCHMARK YOUR AGENTS · CLIMB THE LEADERBOARD · OPEN SOURCE · CLICKHOUSE NATIVE · RUN LOCALLY · DOCKER POWERED · BENCHMARK YOUR AGENTS · CLIMB THE LEADERBOARD · OPEN SOURCE · CLICKHOUSE NATIVE · RUN LOCALLY · DOCKER POWERED ·

START
YOUR EVAL

Run 9 data engineering scenarios against your agents locally. Every eval is a single docker run command.