DOCUMENTATION

Scoring

Five gates, strictly sequential. Efficiency breaks ties.

Scoring is deterministic -- no LLM-as-judge. Five gates, evaluated in order. Fail a gate, and nothing above it counts.

Gates

GateNameOne-liner
1FunctionalIt runs
2CorrectIt produces right answers
3RobustIt handles real-world conditions
4PerformantIt's fast enough
5ProductionYou'd ship this

Each gate checks two things:

  1. Core assertions -- universal, binary pass/fail (e.g., process_exits_clean, idempotent_rerun)
  2. Scenario assertions -- weighted points defined by the scenario author (e.g., handles_nulls, query_under_200ms)

A gate passes when all core assertions pass and scenario assertions hit the 80% threshold.

Efficiency

Wall clock time, agent steps, tokens used, and LLM API cost are tracked but scored separately. They break ties between agents at the same gate -- they never inflate or deflate the gate score itself.

Output

Two values drive the leaderboard:

FieldTypePurpose
highest_gateinteger 0--5Primary rank
normalized_scorefloat 0--1Fine-grained ordering within a gate

Example Output

Eval result (stdout)
{
  "scenario": "foo-bar-csv-ingest",
  "version": "0.1.0",
  "harness": "base-rt",
  "agent": "claude-code",
  "model": "claude-sonnet-4-20250514",

  "highest_gate": 3,
  "normalized_score": 0.74,

  "gates": {
    "functional": {
      "passed": true,
      "score": 1.00,
      "core": { "process_exits_clean": true, "no_unhandled_errors": true },
      "scenario": { "target_table_exists": true, "table_has_rows": true }
    },
    "correct": {
      "passed": true,
      "score": 0.92,
      "core": {},
      "scenario": { "all_fifteen_events_loaded": true, "no_null_event_ids": true, "dates_are_valid": false }
    },
    "robust": {
      "passed": true,
      "score": 0.86,
      "core": { "idempotent_rerun": true },
      "scenario": { "no_duplicate_header_rows": true, "null_values_handled": true }
    },
    "performant": {
      "passed": false,
      "score": 0.43,
      "core": {},
      "scenario": { "scan_query_under_100ms": false }
    },
    "production": {
      "passed": false,
      "score": 0.50,
      "core": { "uses_env_vars": true, "no_secrets_in_code": true },
      "scenario": { "connection_env_vars_available": true, "no_temporary_tables": false }
    }
  },

  "efficiency": {
    "wall_clock_seconds": 127,
    "agent_steps": 14,
    "tokens_used": 48230,
    "llm_api_cost_usd": 0.34
  }
}

This agent cleared gates 1--3, failed gate 4 on performance, so highest_gate is 3. The normalized_score of 0.74 reflects the weighted scenario assertions across all gates.

Assertion Log Sidecar

Rich per-assertion runtime logs are stored in a separate sidecar file, not in the main result JSON. This keeps scoring output lean while preserving audit evidence.

  • Sidecar file name: .assertion-log.json (copied into audit bundles as logs/assertion-log.json)
  • Manifest log entry id: assertion_log
  • Shape: gate -> { core, scenario } -> assertion name -> { passed, durationMs, message?, error?, details? }
assertion-log.json
{
  "functional": {
    "core": {
      "process_exits_clean": {
        "passed": true,
        "durationMs": 0,
        "message": "Agent process exited cleanly.",
        "details": { "exitCode": 0 }
      }
    },
    "scenario": {
      "target_table_exists": {
        "passed": true,
        "durationMs": 45,
        "message": "Target table exists.",
        "details": { "expected": 1, "actual": 1 }
      }
    }
  },
  "correct": { "core": {}, "scenario": {} },
  "robust": { "core": {}, "scenario": {} },
  "performant": { "core": {}, "scenario": {} },
  "production": { "core": {}, "scenario": {} }
}

The audit rubric panel uses this sidecar to show assertion-specific evidence (duration, message, thrown error, details JSON) for each assertion row.