DOCUMENTATION

Architecture

Layered Docker images, harness scripts, and network policy.

Every benchmark image stacks four layers. Docker caches aggressively -- running ten agent/model combos against the same scenario downloads the scenario layer once.

Layer hierarchy
base                          OS, runtimes, database CLIs, scoring framework
  └── scenario                infrastructure, seed data, prompt, validation
        └── harness           pre-installed tools
              └── agent       agent CLI + runner script + model config

Base

Built on debian:bookworm-slim (Alpine's musl breaks too many tools).

Includes:

  • Runtimes: Python 3, pip, venv, Node.js 22
  • CLIs: bash, curl, git, jq, psql, mysql-client, clickhouse-client, rpk
  • Data infrastructure: Postgres 16, ClickHouse, Redpanda (streaming)
  • Python packages: psycopg2-binary, clickhouse-connect, kafka-python, sqlalchemy, pandas
  • Infrastructure: supervisord for running services inside the container
  • Scoring: gate assertion framework, session recording, efficiency metrics

Scenario

A self-contained data engineering problem. The container boots infrastructure in a broken or incomplete state. The agent fixes it.

Services run locally via supervisord -- Postgres, ClickHouse, Redpanda, all inside the container. Connection strings are exported as environment variables.

Harness

A shell script that installs tools. That's the whole abstraction.

harnesses/classic-de.json installScript
apt-get update && apt-get install -y --no-install-recommends openjdk-17-jre-headless && rm -rf /var/lib/apt/lists/*
pip3 install --no-cache-dir --break-system-packages dbt-core==1.10.19 dbt-postgres==1.10.0 dbt-clickhouse==1.10.0 apache-airflow==2.10.5 pyspark==3.5.5

The Dockerfile runs it at build time. Contributing a harness = writing a shell script.

Agent + Model

Agent-specific runner scripts are baked into the image (claude-code, codex, and cursor):

agents/claude-code/run.sh
PROMPT_FILE="/scenario/prompts/${PERSONA:-naive}.md"
claude -p "$(cat "$PROMPT_FILE")" \
  --model "${MODEL:-claude-sonnet-4-20250514}" \
  --allowedTools "Bash(command:*)" \
  --max-turns 50

Custom agents mount or bake in their own runner. See Running Evals.

Network Access

Outbound traffic is open by default. The agent can install packages, call APIs, and search the web -- just like a developer on a real machine.

Pass --network-policy=restricted for stricter evaluations. In restricted mode, all outbound traffic is blocked except explicitly allowlisted endpoints:

Restricted mode: base rules
iptables -A OUTPUT -o lo -j ACCEPT
iptables -P OUTPUT DROP

Each agent adds its LLM provider endpoints:

agents/claude-code/iptables.sh
iptables -I OUTPUT -d api.anthropic.com -j ACCEPT
agents/codex/iptables.sh
iptables -I OUTPUT -d api.openai.com -j ACCEPT
agents/cursor/iptables.sh
iptables -I OUTPUT -d api.cursor.sh -j ACCEPT
iptables -I OUTPUT -d api2.cursor.sh -j ACCEPT

Each harness may add endpoints for its tools when running in restricted mode.

Rules stack. The final image allows the union of all endpoints from its agent and harness layers.