Getting Started
Create and contribute a full DEC Bench eval from scenario design to JSON output.
This guide walks you through the full eval authoring flow: design a scenario, define its starting state, write assertions for all five gates (how the agent's performance is measured), select a harness, register the eval, and run it locally.
What You Will Build
You will create a self-contained data engineering eval that runs inside DEC Bench and emits deterministic scores as the eval's output.
Scenarios can start from either model:
- Broken/incomplete start: infrastructure boots with defects the agent must diagnose and fix.
- Clean/greenfield start: infrastructure is healthy and the agent must build the requested solution from scratch.
Prerequisites
- Node.js >= 20
pnpm>= 10.4- Docker
- Local clone of the repository
Install dependencies:
pnpm installRecommended context before authoring:
Bootstrap a Scenario
The fastest way to start is with the CLI scaffold command. It generates all required files and directories so you can focus on writing the eval.
Flag-based (single command)
Pass all options as flags to skip prompts entirely:
dec-bench create \
--name foo-bar-my-scenario \
--domain foo-bar \
--tier tier-1Available flags
| Flag | Required | Default | Description |
|---|---|---|---|
--name / -n | yes | — | Scenario ID, used as directory name and JSON id |
--domain / -d | yes | — | Business domain (foo-bar for v0.1) |
--tier / -t | no | tier-1 | Difficulty tier (tier-1, tier-2, tier-3) |
--harness | no | base-rt | Evaluation harness profile |
--dir | no | scenarios | Root directory for scenario output |
What gets generated
Every file is pre-populated with the right structure and placeholder comments. The scenario.json is pre-filled with your --domain, --tier, and --harness values.
Authoring flow
Once scaffolded, work through each file in order:
1) Design the Problem
Pick the eval target before you write any files:
- Domain: for v0.1, use
foo-bar. See Foo Bar for context. - Competency focus: choose primary skill(s) from Competencies
- Tier: set difficulty -- see Difficulty Tiers for guidance on scoping
- Starting-state model:
- broken/incomplete (fix and recover)
- clean/greenfield (build and deliver)
Define success in concrete terms. Good evals produce pass/fail outcomes, not subjective judgments.
2) Write the Prompts
Each scenario has a prompts/ directory with two files -- one per persona. Each file is passed directly to the agent as its task input.
naive.md-- a user who knows what they want but not how to get there. Plain language, no tool names, no implementation hints.savvy.md-- an experienced engineer who names tools, specifies targets, and sets technical constraints.
Both prompts ask for the same outcome and are scored against the same assertions. The persona changes how much the user helps the agent, not what success looks like.
I have a bunch of CSV files that need to go into ClickHouse.
Some of them have messy data — weird dates, missing values.
Can you get it all loaded into a clean table?Ingest five CSV files from /data/csv/ into a single ClickHouse
table `analytics.events`. Handle DD/MM/YYYY dates, mixed null
representations (N/A, null, empty), duplicate header rows
mid-file, and trailing delimiters. Target schema: event_id
String, event_ts DateTime, user_id String, event_type String,
value Float64 (nullable values should be 0).3) Set Up Infrastructure and Seed Data
All services run inside a single container via supervisord. Define which services to start in supervisord.conf and use init/ scripts to set up their initial state.
[program:postgres]
command=/usr/lib/postgresql/16/bin/postgres -D /var/lib/postgresql/data
autostart=true
autorestart=false
[program:clickhouse]
command=/usr/bin/clickhouse-server --config-file=/etc/clickhouse-server/config.xml
autostart=true
autorestart=falseConnection strings are exported as environment variables ($POSTGRES_URL, $CLICKHOUSE_URL, etc.) so the agent and assertions can use them directly.
Broken/Incomplete Start
Use init scripts to intentionally introduce defects:
- misconfigured service connection
- missing index or partitioning
- schema drift between source and model
- permission mismatch or dependency gap
Clean/Greenfield Start
Use init scripts to provide healthy infrastructure and realistic source data:
- base schemas and source tables
- event streams or append-only source logs
- clear boundary of what the agent must create
In both models:
- seed deterministic data
- export connection settings via environment variables
- avoid hidden state that makes runs non-reproducible
4) Write Gate Assertions
Each gate has a TypeScript file that exports named assertion functions. The gate assertion framework in the base image discovers and runs them, collecting results into the structured JSON output described in Scoring.
| Gate | Goal | Typical checks |
|---|---|---|
| Functional | It runs | process exits cleanly, expected artifacts exist |
| Correct | It is right | row counts, checksums, schema/type expectations |
| Robust | It handles edge cases | nulls/dupes/late events, idempotent reruns |
| Performant | It is fast enough | query latency, pipeline runtime thresholds |
| Production | You would ship it | env-var usage, tests present, secret hygiene |
Each assertion is a named async function that returns an AssertionResult object with passed, plus optional message and details fields. The framework provides a context object with database clients and environment access, and the message/details are surfaced in the audit panel's per-assertion log tab.
import type { AssertionContext, AssertionResult } from "@dec-bench/eval-core";
async function queryRows<T>(ctx: AssertionContext, sql: string): Promise<T[]> {
const result = await ctx.clickhouse.query({ query: sql, format: "JSONEachRow" });
return (await (result as any).json()) as T[];
}
export async function target_table_exists(ctx: AssertionContext): Promise<AssertionResult> {
const rows = await queryRows<{ n: number }>(
ctx,
"SELECT count() AS n FROM system.tables WHERE database = 'analytics' AND name = 'events'",
);
const count = Number(rows[0]?.n ?? 0);
const passed = count === 1;
return {
passed,
message: passed ? "Target table exists." : `Expected 1 table, got ${count}.`,
details: { expected: 1, actual: count },
};
}
export async function table_has_rows(ctx: AssertionContext): Promise<AssertionResult> {
const rows = await queryRows<{ n: number }>(ctx, "SELECT count() AS n FROM analytics.events");
const count = Number(rows[0]?.n ?? 0);
const passed = count > 0;
return {
passed,
message: passed ? "Table has rows." : "Table is empty.",
details: { rowCount: count },
};
}import type { AssertionContext, AssertionResult } from "@dec-bench/eval-core";
async function queryRows<T>(ctx: AssertionContext, sql: string): Promise<T[]> {
const result = await ctx.clickhouse.query({ query: sql, format: "JSONEachRow" });
return (await (result as any).json()) as T[];
}
export async function all_fifteen_events_loaded(ctx: AssertionContext): Promise<AssertionResult> {
const rows = await queryRows<{ n: number }>(ctx, "SELECT count() AS n FROM analytics.events");
const count = Number(rows[0]?.n ?? 0);
const passed = count === 15;
return {
passed,
message: passed ? "All 15 events loaded." : `Expected 15, got ${count}.`,
details: { expected: 15, actual: count },
};
}Keep assertions deterministic, fast, and focused on a single check each. The framework maps each exported function name directly to the assertion keys in the scoring output and assertion-log sidecar.
5) Select or Add a Harness
A harness installs tools in the image layer used by the agent.
Use an existing harness profile when possible. v0.1 ships with three:
| Harness | What it installs |
|---|---|
base-rt | Nothing -- base infrastructure (ClickHouse, Redpanda, Postgres) with Python, Node.js, database CLIs |
classic-de | dbt-core, dbt-postgres, dbt-clickhouse, apache-airflow, pyspark |
olap-for-swe | MooseStack (moose-cli, moose-lib) |
Create a custom harness only if tool requirements are truly new:
pip3 install --no-cache-dir my-special-tool==1.0.06) Register the Scenario
Create scenario metadata that maps cleanly to DEC Bench scenario types:
{
"id": "foo-bar-csv-ingest",
"title": "Foo Bar CSV Ingest",
"description": "Load five messy CSV files into clean ClickHouse tables.",
"tier": "tier-1",
"domain": "foo-bar",
"harness": "base-rt",
"tasks": [
{
"id": "ingest-csvs",
"description": "Create a ClickHouse table and load all five CSV files.",
"category": "ingestion"
}
],
"personaPrompts": {
"naive": "prompts/naive.md",
"savvy": "prompts/savvy.md"
},
"tags": ["csv", "ingestion", "data-cleaning", "type-coercion"],
"baselineMetrics": {
"queryLatencyMs": 0,
"storageBytes": 0,
"costPerQueryUsd": 0
},
"referenceMetrics": {
"queryLatencyMs": 50,
"storageBytes": 5000000,
"costPerQueryUsd": 0.001
}
}Use packages/scenarios/src/types.ts as the schema contract for required fields and allowed enum values.
7) Run and Validate Locally
Run:
dec-bench run \
--scenario foo-bar-csv-ingest \
--harness base-rt \
--persona naive \
--mode no-planVerify output:
- JSON emits to stdout
- gate-level pass/fail and scores are present
- failures are actionable from assertion output
8) PR Readiness Checklist
prompt.mdis specific, scoped, and testable- all five gate assertion
.tsfiles exist and export valid assertion functions - deterministic seed data is included
- no hardcoded secrets or credentials
- scenario metadata matches allowed
tier,domain, andharnessvalues - local run output is stable across repeated runs
Validation commands:
pnpm --filter @dec-bench/scenarios check-types
pnpm --filter @dec-bench/scenarios lint
pnpm lint