Documentation
Add a Scenario
Design a data engineering task, define how it is measured, and publish it to the benchmark.
This guide walks through creating and testing a new scenario locally, then publishing it to DEC Bench via pull request.
Each scenario is a self-contained data engineering task that runs across all agents, harnesses, and personas. To create one, you define:
- The scenario itself: infrastructure, seed data, and the problem statement
- The prompts used by each persona
- The assertions that prove gate achievement
Before creating a new scenario, check whether a similar one already exists. The scenarios directory contains all current scenarios. Use an LLM to search through them if you are unsure.
Prerequisites
- Docker installed and running
- Node.js >= 20 and
pnpm>= 10.4 - The
dec-benchCLI:curl -fsSL https://decbench.ai/install.sh | sh(see Quickstart) - A local clone of the repo:
git clone https://github.com/514-labs/agent-evals.git && cd agent-evals && pnpm install - DEC Bench agent skills (run from the repo root):
npx skills add . --skill dec-bench-quickstart -a claude-code -a cursor -a codexnpx skills add . --skill dec-bench-run -a claude-code -a cursor -a codexnpx skills add . --skill dec-bench-create-scenario -a claude-code -a cursor -a codex
With the skills installed, open your agent in the repo directory and it can walk you through the entire process:
Create a tier-1 scenario that tests ClickHouse query optimization on the foo-bar domain. The agent should find and fix slow analytical queries on a table with a naive ordering key.
Your agent will gather any missing context, scaffold the scenario with dec-bench create, and guide you through filling in prompts, assertions, and seed data. The steps below describe the same process manually.
1. Design the problem
Before scaffolding, decide:
- Domain: for v0.x, use
foo-bar - Tier:
tier-1(single skill),tier-2(multi-step), ortier-3(cross-system). See the tier definitions on the homepage. - Starting state:
- Broken/incomplete: infrastructure boots with defects the agent must diagnose and fix
- Clean/greenfield: infrastructure is healthy and the agent builds from scratch
Define success in concrete, pass/fail terms. Good evals produce deterministic outcomes, not subjective judgments.
2. Scaffold the scenario
2a. Run the scaffold
The CLI generates all required files and directories.
dec-bench create \
--name <my-scenario> \ # required — scenario ID, used as directory name
--domain <my-domain> \ # required — business domain (foo-bar for v0.1)
--tier tier-1 \ # difficulty tier: tier-1, tier-2, tier-3 (see homepage)
--harnesses base-rt # comma-separated: base-rt, classic-de, olap-for-sweThis generates:
| Path | Purpose |
|---|---|
assertions/ | Pass/fail checks per quality gate — see step 5 |
harnesses/<harness-id>/prompts/ | What the agent is told to do (one per persona), scoped to this harness — see step 3 |
init/ | SQL and scripts to seed data before the agent starts — see step 4 |
scenario.json | Scenario metadata — see 2b below |
supervisord.conf | Which services start in the container — see step 4 |
One harnesses/<harness-id>/ subdirectory is scaffolded per harness you pass to --harnesses. The harness-scenario pair owns its prompts (and optionally harness-specific init scripts and install steps).
Object model
A scenario is a matrix: 1 scenario × N harnesses × 2 personas = 2N evaluations. The scenario root holds what's shared across all harnesses (scenario.json, supervisord.conf, init/, assertions/). Each harnesses/<harness-id>/ directory is the unit of ownership for its (scenario, harness) pair: it always has prompts/baseline.md + prompts/informed.md, and may have its own init/ and install.sh.
Why prompts and init can differ per harness. Prompts diverge to control whether the agent reaches for a specific tool: the baseline tests whether the agent picks it unprompted, while an informed prompt that names the tool tests how well it uses the tool when told to. Different harnesses ship different tools, so informed prompts usually name different tools per harness. Init diverges when harnesses need different starting infrastructure (e.g. a scaffolded Moose project for olap-for-swe vs a scaffolded dbt project for classic-de) so each harness boots into the state its tools expect.
2b. Complete scenario metadata
The scaffold command creates scenario.json and pre-fills id, domain, tier, and harnesses from your flags. The CLI reads this file to build the Docker image, select which harnesses and personas to run, and display results. Open it and fill in title, description, tasks, and tags:
{
"id": "foo-bar-csv-ingest",
- "title": "",
- "description": "",
+ "title": "Foo Bar CSV Ingest",
+ "description": "Load five messy CSV files into clean ClickHouse tables.",
"tier": "tier-1",
"domain": "foo-bar",
"harnesses": ["base-rt", "classic-de", "olap-for-swe"],
- "tasks": [],
+ "tasks": [
+ {
+ "id": "ingest-csvs",
+ "description": "Create a ClickHouse table and load all five CSV files.",
+ "category": "ingestion"
+ }
+ ],
- "tags": []
+ "tags": ["csv", "ingestion", "data-cleaning", "type-coercion"]
}3. Write the persona prompts
Each harness-scenario pair tests two prompts, one per persona. Prompts live under harnesses/<harness-id>/prompts/ — this lets the same scenario ship different prompts per harness when needed (e.g. an informed prompt that names tools specific to olap-for-swe).
harnesses/<harness-id>/prompts/baseline.md: plain language, no tool names or implementation hints. Tests what the agent figures out on its own.harnesses/<harness-id>/prompts/informed.md: names specific tools, sets explicit targets, provides technical constraints. Tests whether domain knowledge changes the outcome.
Examples:
I have a bunch of CSV files that need to go into ClickHouse.
Some of them have messy data -- weird dates, missing values.
Can you get it all loaded into a clean table?Ingest five CSV files from /data/csv/ into a single ClickHouse
table `analytics.events`. Handle DD/MM/YYYY dates, mixed null
representations (N/A, null, empty), duplicate header rows
mid-file, and trailing delimiters. Target schema: event_id
String, event_ts DateTime, user_id String, event_type String,
value Float64 (nullable values should be 0).Both prompts should target the same outcome and the baseline should be solvable.
For more details, see Adding Persona Prompts.
4. Set up infrastructure and seed data
Services run inside the scenario container via supervisord.conf. Init scripts in init/ set up their initial state.
[program:postgres]
command=/usr/lib/postgresql/16/bin/postgres -D /var/lib/postgresql/data
autostart=true
autorestart=false
[program:clickhouse]
command=/usr/bin/clickhouse-server --config-file=/etc/clickhouse-server/config.xml
autostart=true
autorestart=falseConnection strings are exported as environment variables ($POSTGRES_URL, $CLICKHOUSE_URL, etc.) so the agent and assertions can use them directly.
Seed data using init scripts in init/. These run after services are ready but before the agent starts. They can be SQL files, shell scripts, or anything executable.
CREATE SCHEMA IF NOT EXISTS app;
CREATE TABLE app.transactions (
id SERIAL PRIMARY KEY,
customer_id TEXT NOT NULL,
amount NUMERIC(10, 2) NOT NULL,
created_at TIMESTAMPTZ DEFAULT now()
);
INSERT INTO app.transactions (customer_id, amount, created_at) VALUES
('c001', 10.00, '2026-01-15T09:00:00Z'),
('c002', 25.50, '2026-01-15T09:05:00Z'),
('c003', 33.33, '2026-01-15T09:10:00Z'),
('c004', 47.99, '2026-01-15T09:15:00Z'),
('c005', 52.00, '2026-01-15T09:20:00Z');For broken/incomplete starts, use init scripts to introduce defects (misconfigured connections, missing indexes, schema drift). For clean/greenfield starts, provide healthy infrastructure and realistic source data.
Seed deterministic data, export connection settings via environment variables, and avoid hidden state that makes runs non-reproducible.
Tools (CLIs, frameworks like Moose or dbt) belong in the harness installScript — not in init/*.sh — so they're baked into the image once instead of reinstalled every run. See Creating a Custom Harness. For comparison scenarios where starting state differs per harness, put harness-specific seed scripts in harnesses/<harness-id>/init/ — they run only when that harness is active.
For more details, see Adding Multiple Services.
5. Write gate assertions
Each scenario has five assertion files in assertions/, one per quality gate. The framework includes core assertions that run for every scenario (e.g. process_exits_clean, no_unhandled_errors). Your files add scenario-specific checks on top of these.
Each assertion returns an AssertionResult with passed, plus optional message and details fields. See assertions in existing scenarios for reference.
import type { AssertionContext, AssertionResult } from "@dec-bench/eval-core";
export async function target_table_exists(ctx: AssertionContext): Promise<AssertionResult> {
const result = await ctx.clickhouse.query({
query: "SELECT count() AS n FROM system.tables WHERE database = 'analytics' AND name = 'events'",
format: "JSONEachRow",
});
const rows = await (result as any).json();
const count = Number(rows[0]?.n ?? 0);
return {
passed: count === 1,
message: count === 1 ? "Target table exists." : `Expected 1 table, got ${count}.`,
details: { expected: 1, actual: count },
};
}One check per function, actionable failure messages, deterministic results.
For more details, see Writing Assertions.
6. Run and validate locally
Test with a single agent/harness/persona combination. The benchmark will run all combinations once the scenario is published.
dec-bench build --scenario <my-scenario>
dec-bench run --scenario <my-scenario>Verify the output:
- Gate-level pass/fail and scores are present
- Failures are actionable from assertion messages
- Results are stable across repeated runs
Publishing to DEC Bench
At this point your scenario runs locally — you can build, run, and iterate on it independently. If you'd like it included in the DEC Bench reference scenario suite, open a pull request to the repository. First-time contributors will be asked to sign a Contributor License Agreement (CLA) — the bot will prompt you on your PR. Your PR should:
- Pass validation checks:
pnpm --filter @dec-bench/scenarios check-types && pnpm lint - Include deterministic seed data (no external dependencies or hidden state)
- Have prompts that are specific, scoped, and testable
- Include all five gate assertion files with exported functions
- Contain no hardcoded secrets or credentials