DOCUMENTATION

Difficulty Tiers

How to scope and classify DEC Bench scenarios by complexity.

Every scenario is assigned a difficulty tier that reflects the scope of infrastructure, the number of tasks, and the depth of reasoning required from the agent.

Tier Definitions

TierScopeServicesAssertionsExample
tier-1Single service, one focused task, minimal moving parts13--5 per gateFix a broken Postgres connection; create one dbt model from a single source table
tier-2Multiple tasks or services, moderate debugging or design decisions1--25--10 per gateBuild an ingestion pipeline with schema validation and idempotent reruns
tier-3Cross-service orchestration, performance tuning, production-grade constraints2--3+10+ per gateEnd-to-end ELT from Postgres through Redpanda to ClickHouse with latency targets and incremental loads

Choosing a Tier

Start with tier-1 for your first eval. Higher tiers require more infrastructure setup, more seed data, and more assertions to cover the expanded surface area.

Tier-1 scenarios are good for isolating a single competency in a controlled environment. They are fast to author, fast to run, and easy to debug when assertions fail.

Tier-2 scenarios introduce realistic complexity -- multiple tasks, cross-table dependencies, or light multi-service coordination. Most production-relevant evals land here.

Tier-3 scenarios test end-to-end system design under real constraints. They are expensive to author and run, but they surface the kind of reasoning that separates strong agents from weak ones.

Tier and Gate Interaction

Tier does not change how gates work. All five gates apply at every tier. What changes is the number and depth of scenario assertions within each gate:

  • A tier-1 functional gate might check that one table exists.
  • A tier-3 functional gate might check that three services are healthy, five tables exist, and a streaming topic has consumers attached.