Architecture
Layered Docker images, harness scripts, and network policy.
Every benchmark image stacks four layers. Docker caches aggressively -- running ten agent/model combos against the same scenario downloads the scenario layer once.
base OS, runtimes, database CLIs, scoring framework
└── scenario infrastructure, seed data, prompt, validation
└── harness pre-installed tools
└── agent agent CLI + runner script + model configBase
Built on debian:bookworm-slim (Alpine's musl breaks too many tools).
Includes:
- Runtimes: Python 3, pip, venv, Node.js 22
- CLIs: bash, curl, git, jq, psql, mysql-client, clickhouse-client, rpk
- Data infrastructure: Postgres 16, ClickHouse, Redpanda (streaming)
- Python packages: psycopg2-binary, clickhouse-connect, kafka-python, sqlalchemy, pandas
- Infrastructure: supervisord for running services inside the container
- Scoring: gate assertion framework, session recording, efficiency metrics
Scenario
A self-contained data engineering problem. The container boots infrastructure in a broken or incomplete state. The agent fixes it.
Services run locally via supervisord -- Postgres, ClickHouse, Redpanda, all inside the container. Connection strings are exported as environment variables.
Harness
A shell script that installs tools. That's the whole abstraction.
apt-get update && apt-get install -y --no-install-recommends openjdk-17-jre-headless && rm -rf /var/lib/apt/lists/*
pip3 install --no-cache-dir --break-system-packages dbt-core==1.10.19 dbt-postgres==1.10.0 dbt-clickhouse==1.10.0 apache-airflow==2.10.5 pyspark==3.5.5The Dockerfile runs it at build time. Contributing a harness = writing a shell script.
Agent + Model
Agent-specific runner scripts are baked into the image (claude-code, codex, and cursor):
PROMPT_FILE="/scenario/prompts/${PERSONA:-naive}.md"
claude -p "$(cat "$PROMPT_FILE")" \
--model "${MODEL:-claude-sonnet-4-20250514}" \
--allowedTools "Bash(command:*)" \
--max-turns 50Custom agents mount or bake in their own runner. See Running Evals.
Network Access
Outbound traffic is open by default. The agent can install packages, call APIs, and search the web -- just like a developer on a real machine.
Pass --network-policy=restricted for stricter evaluations. In restricted mode, all outbound traffic is blocked except explicitly allowlisted endpoints:
iptables -A OUTPUT -o lo -j ACCEPT
iptables -P OUTPUT DROPEach agent adds its LLM provider endpoints:
iptables -I OUTPUT -d api.anthropic.com -j ACCEPTiptables -I OUTPUT -d api.openai.com -j ACCEPTiptables -I OUTPUT -d api.cursor.sh -j ACCEPT
iptables -I OUTPUT -d api2.cursor.sh -j ACCEPTEach harness may add endpoints for its tools when running in restricted mode.
Rules stack. The final image allows the union of all endpoints from its agent and harness layers.