DocsAdd a ScenarioCalibrating and Maintaining Scenarios

Documentation

Calibrating and Maintaining Scenarios

Review results, adjust thresholds, and version changes to keep scenarios reliable over time.

A scenario is not finished when it merges. Reviewing results across runs can reveal problems with prompts, assertions, or seed data that were not obvious during authoring.

Reviewing results

After a scenario has been run across multiple agents and models, look for patterns in the results:

  • Failure clusters: if every agent fails the same assertion, the assertion may be too strict or the prompt may be missing critical context.
  • Universal passes: if every agent passes every gate with both personas, the assertions may not be testing enough. Review whether the checks are thorough.
  • Flaky results: individual runs will vary because LLM output is generative. This is expected. But if results for the same agent/model combination are inconsistent across many runs, the scenario may have avoidable non-determinism. Check seed data, service startup order, and assertion timing.

Adjusting thresholds

Performance and production gate assertions often include numeric thresholds (query latency, row counts, timing budgets). When adjusting these:

  • Document the rationale for the change. "Relaxed from 100ms to 200ms because baseline hardware varies" is traceable. A silent threshold change is not.
  • Test the new threshold against existing results before merging. A threshold change that retroactively flips pass/fail for published runs affects leaderboard interpretation.

Breaking changes

Any change to a scenario that can alter its scoring output is a breaking change:

  • Modified assertions (new checks, changed thresholds, removed checks)
  • Changed seed data or init scripts
  • Updated prompts

When making breaking changes, note them clearly in the PR description. Results from before and after a breaking change are not directly comparable.