Documentation

Calibrating and Maintaining Scenarios

Review results, adjust thresholds, and version changes to keep scenarios reliable over time.

A scenario is not finished when it merges. Reviewing results across runs can reveal problems with prompts, assertions, or seed data that were not obvious during authoring.

Reviewing results

After a scenario has been run across multiple agents and models, look for patterns in the results:

Failure clusters: if every agent fails the same assertion, the assertion may be too strict or the prompt may be missing critical context.
Universal passes: if every agent passes every gate with both personas, the assertions may not be testing enough. Review whether the checks are thorough.
Flaky results: individual runs will vary because LLM output is generative. This is expected. But if results for the same agent/model combination are inconsistent across many runs, the scenario may have avoidable non-determinism. Check seed data, service startup order, and assertion timing.

Adjusting thresholds

Performance and production gate assertions often include numeric thresholds (query latency, row counts, timing budgets). When adjusting these:

Document the rationale for the change. "Relaxed from 100ms to 200ms because baseline hardware varies" is traceable. A silent threshold change is not.
Test the new threshold against existing results before merging. A threshold change that retroactively flips pass/fail for published runs affects leaderboard interpretation.

Breaking changes

Any change to a scenario that can alter its scoring output is a breaking change:

Modified assertions (new checks, changed thresholds, removed checks)
Changed seed data or init scripts
Updated prompts

When making breaking changes, note them clearly in the PR description. Results from before and after a breaking change are not directly comparable.