DocsAdd a Scenario

Documentation

Add a Scenario

Design a data engineering task, define how it is measured, and publish it to the benchmark.

This guide walks through creating and testing a new scenario locally, then publishing it to DEC Bench via pull request.

Each scenario is a self-contained data engineering task that runs across all agents, harnesses, and personas. To create one, you define:

  1. The scenario itself: infrastructure, seed data, and the problem statement
  2. The prompts used by each persona
  3. The assertions that prove gate achievement

Before creating a new scenario, check whether a similar one already exists. The scenarios directory contains all current scenarios. Use an LLM to search through them if you are unsure.

Prerequisites

  • Docker installed and running
  • Node.js >= 20 and pnpm >= 10.4
  • The dec-bench CLI: curl -fsSL https://decbench.ai/install.sh | sh (see Quickstart)
  • A local clone of the repo: git clone https://github.com/514-labs/agent-evals.git && cd agent-evals && pnpm install
  • DEC Bench agent skills (run from the repo root):
    • npx skills add . --skill dec-bench-quickstart -a claude-code -a cursor -a codex
    • npx skills add . --skill dec-bench-run -a claude-code -a cursor -a codex
    • npx skills add . --skill dec-bench-create-scenario -a claude-code -a cursor -a codex

With the skills installed, open your agent in the repo directory and it can walk you through the entire process:

Create a tier-1 scenario that tests ClickHouse query optimization on the foo-bar domain. The agent should find and fix slow analytical queries on a table with a naive ordering key.

Your agent will gather any missing context, scaffold the scenario with dec-bench create, and guide you through filling in prompts, assertions, and seed data. The steps below describe the same process manually.

1. Design the problem

Before scaffolding, decide:

  • Domain: for v0.x, use foo-bar
  • Tier: tier-1 (single skill), tier-2 (multi-step), or tier-3 (cross-system). See the tier definitions on the homepage.
  • Starting state:
    • Broken/incomplete: infrastructure boots with defects the agent must diagnose and fix
    • Clean/greenfield: infrastructure is healthy and the agent builds from scratch

Define success in concrete, pass/fail terms. Good evals produce deterministic outcomes, not subjective judgments.

2. Scaffold the scenario

2a. Run the scaffold

The CLI generates all required files and directories.

Scaffold a new scenario
dec-bench create \
  --name <my-scenario> \        # required — scenario ID, used as directory name
  --domain <my-domain> \        # required — business domain (foo-bar for v0.1)
  --tier tier-1 \               # difficulty tier: tier-1, tier-2, tier-3 (see homepage)
  --harnesses base-rt           # comma-separated: base-rt, classic-de, olap-for-swe

This generates:

scenario.json
supervisord.conf
PathPurpose
assertions/Pass/fail checks per quality gate — see step 5
harnesses/<harness-id>/prompts/What the agent is told to do (one per persona), scoped to this harness — see step 3
init/SQL and scripts to seed data before the agent starts — see step 4
scenario.jsonScenario metadata — see 2b below
supervisord.confWhich services start in the container — see step 4

One harnesses/<harness-id>/ subdirectory is scaffolded per harness you pass to --harnesses. The harness-scenario pair owns its prompts (and optionally harness-specific init scripts and install steps).

Object model

A scenario is a matrix: 1 scenario × N harnesses × 2 personas = 2N evaluations. The scenario root holds what's shared across all harnesses (scenario.json, supervisord.conf, init/, assertions/). Each harnesses/<harness-id>/ directory is the unit of ownership for its (scenario, harness) pair: it always has prompts/baseline.md + prompts/informed.md, and may have its own init/ and install.sh.

Why prompts and init can differ per harness. Prompts diverge to control whether the agent reaches for a specific tool: the baseline tests whether the agent picks it unprompted, while an informed prompt that names the tool tests how well it uses the tool when told to. Different harnesses ship different tools, so informed prompts usually name different tools per harness. Init diverges when harnesses need different starting infrastructure (e.g. a scaffolded Moose project for olap-for-swe vs a scaffolded dbt project for classic-de) so each harness boots into the state its tools expect.

2b. Complete scenario metadata

The scaffold command creates scenario.json and pre-fills id, domain, tier, and harnesses from your flags. The CLI reads this file to build the Docker image, select which harnesses and personas to run, and display results. Open it and fill in title, description, tasks, and tags:

scenario.json
 {
   "id": "foo-bar-csv-ingest",
-  "title": "",
-  "description": "",
+  "title": "Foo Bar CSV Ingest",
+  "description": "Load five messy CSV files into clean ClickHouse tables.",
   "tier": "tier-1",
   "domain": "foo-bar",
   "harnesses": ["base-rt", "classic-de", "olap-for-swe"],
-  "tasks": [],
+  "tasks": [
+    {
+      "id": "ingest-csvs",
+      "description": "Create a ClickHouse table and load all five CSV files.",
+      "category": "ingestion"
+    }
+  ],
-  "tags": []
+  "tags": ["csv", "ingestion", "data-cleaning", "type-coercion"]
 }

3. Write the persona prompts

Each harness-scenario pair tests two prompts, one per persona. Prompts live under harnesses/<harness-id>/prompts/ — this lets the same scenario ship different prompts per harness when needed (e.g. an informed prompt that names tools specific to olap-for-swe).

  • harnesses/<harness-id>/prompts/baseline.md: plain language, no tool names or implementation hints. Tests what the agent figures out on its own.
  • harnesses/<harness-id>/prompts/informed.md: names specific tools, sets explicit targets, provides technical constraints. Tests whether domain knowledge changes the outcome.

Examples:

harnesses/base-rt/prompts/baseline.md
I have a bunch of CSV files that need to go into ClickHouse.
Some of them have messy data -- weird dates, missing values.
Can you get it all loaded into a clean table?
harnesses/base-rt/prompts/informed.md
Ingest five CSV files from /data/csv/ into a single ClickHouse
table `analytics.events`. Handle DD/MM/YYYY dates, mixed null
representations (N/A, null, empty), duplicate header rows
mid-file, and trailing delimiters. Target schema: event_id
String, event_ts DateTime, user_id String, event_type String,
value Float64 (nullable values should be 0).

Both prompts should target the same outcome and the baseline should be solvable.

For more details, see Adding Persona Prompts.

4. Set up infrastructure and seed data

Services run inside the scenario container via supervisord.conf. Init scripts in init/ set up their initial state.

supervisord.conf example
[program:postgres]
command=/usr/lib/postgresql/16/bin/postgres -D /var/lib/postgresql/data
autostart=true
autorestart=false

[program:clickhouse]
command=/usr/bin/clickhouse-server --config-file=/etc/clickhouse-server/config.xml
autostart=true
autorestart=false

Connection strings are exported as environment variables ($POSTGRES_URL, $CLICKHOUSE_URL, etc.) so the agent and assertions can use them directly.

Seed data using init scripts in init/. These run after services are ready but before the agent starts. They can be SQL files, shell scripts, or anything executable.

init/postgres-setup.sql (from foo-bar-cross-system-reconciliation)
CREATE SCHEMA IF NOT EXISTS app;

CREATE TABLE app.transactions (
  id SERIAL PRIMARY KEY,
  customer_id TEXT NOT NULL,
  amount NUMERIC(10, 2) NOT NULL,
  created_at TIMESTAMPTZ DEFAULT now()
);

INSERT INTO app.transactions (customer_id, amount, created_at) VALUES
  ('c001', 10.00, '2026-01-15T09:00:00Z'),
  ('c002', 25.50, '2026-01-15T09:05:00Z'),
  ('c003', 33.33, '2026-01-15T09:10:00Z'),
  ('c004', 47.99, '2026-01-15T09:15:00Z'),
  ('c005', 52.00, '2026-01-15T09:20:00Z');

For broken/incomplete starts, use init scripts to introduce defects (misconfigured connections, missing indexes, schema drift). For clean/greenfield starts, provide healthy infrastructure and realistic source data.

Seed deterministic data, export connection settings via environment variables, and avoid hidden state that makes runs non-reproducible.

Tools (CLIs, frameworks like Moose or dbt) belong in the harness installScript — not in init/*.sh — so they're baked into the image once instead of reinstalled every run. See Creating a Custom Harness. For comparison scenarios where starting state differs per harness, put harness-specific seed scripts in harnesses/<harness-id>/init/ — they run only when that harness is active.

For more details, see Adding Multiple Services.

5. Write gate assertions

Each scenario has five assertion files in assertions/, one per quality gate. The framework includes core assertions that run for every scenario (e.g. process_exits_clean, no_unhandled_errors). Your files add scenario-specific checks on top of these.

Each assertion returns an AssertionResult with passed, plus optional message and details fields. See assertions in existing scenarios for reference.

assertions/functional.ts
import type { AssertionContext, AssertionResult } from "@dec-bench/eval-core";

export async function target_table_exists(ctx: AssertionContext): Promise<AssertionResult> {
  const result = await ctx.clickhouse.query({
    query: "SELECT count() AS n FROM system.tables WHERE database = 'analytics' AND name = 'events'",
    format: "JSONEachRow",
  });
  const rows = await (result as any).json();
  const count = Number(rows[0]?.n ?? 0);
  return {
    passed: count === 1,
    message: count === 1 ? "Target table exists." : `Expected 1 table, got ${count}.`,
    details: { expected: 1, actual: count },
  };
}

One check per function, actionable failure messages, deterministic results.

For more details, see Writing Assertions.

6. Run and validate locally

Test with a single agent/harness/persona combination. The benchmark will run all combinations once the scenario is published.

Run the scenario
dec-bench build --scenario <my-scenario>
dec-bench run --scenario <my-scenario>

Verify the output:

  • Gate-level pass/fail and scores are present
  • Failures are actionable from assertion messages
  • Results are stable across repeated runs

Publishing to DEC Bench

At this point your scenario runs locally — you can build, run, and iterate on it independently. If you'd like it included in the DEC Bench reference scenario suite, open a pull request to the repository. First-time contributors will be asked to sign a Contributor License Agreement (CLA) — the bot will prompt you on your PR. Your PR should:

  • Pass validation checks: pnpm --filter @dec-bench/scenarios check-types && pnpm lint
  • Include deterministic seed data (no external dependencies or hidden state)
  • Have prompts that are specific, scoped, and testable
  • Include all five gate assertion files with exported functions
  • Contain no hardcoded secrets or credentials