Scoring
Five gates, strictly sequential. Efficiency breaks ties.
Scoring is deterministic -- no LLM-as-judge. Five gates, evaluated in order. Fail a gate, and nothing above it counts.
Gates
| Gate | Name | One-liner |
|---|---|---|
| 1 | Functional | It runs |
| 2 | Correct | It produces right answers |
| 3 | Robust | It handles real-world conditions |
| 4 | Performant | It's fast enough |
| 5 | Production | You'd ship this |
Each gate checks two things:
- Core assertions -- universal, binary pass/fail (e.g.,
process_exits_clean,idempotent_rerun) - Scenario assertions -- weighted points defined by the scenario author (e.g.,
handles_nulls,query_under_200ms)
A gate passes when all core assertions pass and scenario assertions hit the 80% threshold.
Efficiency
Wall clock time, agent steps, tokens used, and LLM API cost are tracked but scored separately. They break ties between agents at the same gate -- they never inflate or deflate the gate score itself.
Output
Two values drive the leaderboard:
| Field | Type | Purpose |
|---|---|---|
highest_gate | integer 0--5 | Primary rank |
normalized_score | float 0--1 | Fine-grained ordering within a gate |
Example Output
{
"scenario": "foo-bar-csv-ingest",
"version": "0.1.0",
"harness": "base-rt",
"agent": "claude-code",
"model": "claude-sonnet-4-20250514",
"highest_gate": 3,
"normalized_score": 0.74,
"gates": {
"functional": {
"passed": true,
"score": 1.00,
"core": { "process_exits_clean": true, "no_unhandled_errors": true },
"scenario": { "target_table_exists": true, "table_has_rows": true }
},
"correct": {
"passed": true,
"score": 0.92,
"core": {},
"scenario": { "all_fifteen_events_loaded": true, "no_null_event_ids": true, "dates_are_valid": false }
},
"robust": {
"passed": true,
"score": 0.86,
"core": { "idempotent_rerun": true },
"scenario": { "no_duplicate_header_rows": true, "null_values_handled": true }
},
"performant": {
"passed": false,
"score": 0.43,
"core": {},
"scenario": { "scan_query_under_100ms": false }
},
"production": {
"passed": false,
"score": 0.50,
"core": { "uses_env_vars": true, "no_secrets_in_code": true },
"scenario": { "connection_env_vars_available": true, "no_temporary_tables": false }
}
},
"efficiency": {
"wall_clock_seconds": 127,
"agent_steps": 14,
"tokens_used": 48230,
"llm_api_cost_usd": 0.34
}
}This agent cleared gates 1--3, failed gate 4 on performance, so highest_gate is 3. The normalized_score of 0.74 reflects the weighted scenario assertions across all gates.
Assertion Log Sidecar
Rich per-assertion runtime logs are stored in a separate sidecar file, not in the main result JSON. This keeps scoring output lean while preserving audit evidence.
- Sidecar file name:
.assertion-log.json(copied into audit bundles aslogs/assertion-log.json) - Manifest log entry id:
assertion_log - Shape: gate ->
{ core, scenario }-> assertion name ->{ passed, durationMs, message?, error?, details? }
{
"functional": {
"core": {
"process_exits_clean": {
"passed": true,
"durationMs": 0,
"message": "Agent process exited cleanly.",
"details": { "exitCode": 0 }
}
},
"scenario": {
"target_table_exists": {
"passed": true,
"durationMs": 45,
"message": "Target table exists.",
"details": { "expected": 1, "actual": 1 }
}
}
},
"correct": { "core": {}, "scenario": {} },
"robust": { "core": {}, "scenario": {} },
"performant": { "core": {}, "scenario": {} },
"production": { "core": {}, "scenario": {} }
}The audit rubric panel uses this sidecar to show assertion-specific evidence (duration, message, thrown error, details JSON) for each assertion row.