An open-source data engineering competency benchmark for evaluating an AI agent's ability to tackle real-world data problems.
Each gate is passed if all core and scenario assertions pass. Quantitative metrics then rank evals within each gate.
Nine real-world data engineering challenges across ingestion, schema design, query optimization, debugging, and end-to-end pipeline construction.
Load five messy CSV files into clean ClickHouse tables. Handle inconsistent dates, mixed null representations, duplicate headers, and trailing delimiters.
Redesign a naive ClickHouse table with 5M rows to serve three representative query patterns. Set partition keys, order keys, and compression for target latencies.
Rewrite five slow analytical queries against a 10M-row ClickHouse table to run under latency thresholds without changing result sets.
Diagnose and fix a misconfigured Postgres setup: wrong connection string, partial init script, and hardcoded credentials in a Python script.
Build a three-layer transformation from raw JSON events in Postgres to a sessionized, aggregated daily mart in ClickHouse.
Add a column to a synthetic dimension table, create the corresponding ClickHouse target, build a cross-database view, and verify old queries still work.
Build a pipeline from Postgres to ClickHouse that produces identical results whether run once or three times. Handle updates to existing rows by primary key.
Add data quality checks to a working pipeline. A chaos script injects nulls, duplicates, schema drift, and stale timestamps. Agent must detect all four problems without false positives.
Build a complete pipeline: ingest events from Postgres, transform and load into ClickHouse, and expose an HTTP API serving top products, conversion funnel, and hourly trends.
Each scenario runs against multiple harness configurations. The harness determines what tools the agent has access to.
No extra tooling. ClickHouse, Redpanda, Postgres with Python, Node.js, and database CLIs. The control group.
Standard data engineering toolkit. Airflow + Spark + dbt on top of the base infrastructure.
$ apt-get update && apt-get install -y --no-install-recommends openjdk-17-jre-headless && rm -rf /var/lib/apt/lists/* && pip3 install --no-cache-dir --break-system-packages dbt-core==1.10.19 dbt-postgres==1.10.0 dbt-clickhouse==1.10.0 apache-airflow==2.10.5 pyspark==3.5.5SWE-first data engineering. MooseStack with typed schemas, automated migrations, built-in MCP.
$ npm install -g @514labs/moose-cli@0.6.424 && mkdir -p /opt/dec-bench/moose && cd /opt/dec-bench/moose && npm init -y && npm install @514labs/moose-lib@0.6.424The same scenario across different harnesses directly measures whether tooling helps agents perform better.
Test agents with varying levels of data engineering expertise. Measure adaptability across knowledge levels.
Does your agent think before acting? Compare strategic planners against direct executors.
Real infrastructure, not mocks. Every scenario runs against a production-grade stack.
Transactional source of truth. Schema migrations, referential integrity, row-level operations.
High-throughput event streaming. Topic management, consumer groups, exactly-once delivery.
Columnar analytics engine. Materialized views, real-time aggregation, petabyte-scale queries.
Run 9 data engineering scenarios against your agents locally. Every eval is a single docker run command.