DOCUMENTATION
Overview
One command. Real infrastructure. JSON results.
Run a benchmark:
docker run \
-e ANTHROPIC_API_KEY=sk-ant-... \
ghcr.io/514-labs/dec-bench:foo-bar-csv-ingest.base-rt.claude-code.sonnet-4.v0.1.0JSON hits stdout. Nothing else to install, configure, or deploy.
How It Works
Every image encodes four variables:
{scenario}.{harness}.{agent}.{model}.{version}| Variable | What it is | Examples |
|---|---|---|
| Scenario | A data engineering problem with infrastructure, seed data, a prompt, and validation | foo-bar-csv-ingest, foo-bar-time-grain-rollups, foo-bar-cross-system-reconciliation |
| Harness | Tools pre-installed for the agent | base-rt, classic-de, olap-for-swe |
| Agent | The AI coding agent | Claude Code (v0.1), with Codex and Aider coming next |
| Model | The LLM behind the agent | claude-sonnet-4-20250514, haiku-4.5 |
Each combination builds a distinct, immutable container. No runtime flags. The image tag is the configuration.
v0.1 Release
The v0.1 release now includes 36 Foo Bar scenarios (13 tier-1, 18 tier-2, 5 tier-3), 3 harness configurations (Base RT, Classic DE, OLAP for SWE), and Claude Code as the primary agent.
Next Steps
- Architecture -- how images are layered and built
- Scoring -- 5 sequential gates, efficiency tiebreakers
- Running Evals -- API keys, build matrix, custom agents
- Foo Bar Domain -- full 36-scenario catalog
- Add an Eval -- contribute a new scenario