DOCUMENTATION
Reliability
Evaluate an agent's ability to design data systems that remain correct under failure.
What this competency is
Engineering systems that continue to meet correctness and availability expectations despite component failures, network issues, and operational mistakes.
Why it matters
Data platforms operate in failure-prone environments. Reliability design determines whether incidents become contained events or broad outages.
What to evaluate in agents
- Identification of failure modes and blast radius.
- Redundancy, retry, timeout, and degradation strategies.
- Recovery objectives (RTO/RPO) and incident response design.
- Controls for human error and safe operational changes.
Strong signals
- Enumerates realistic faults and concrete mitigation paths.
- Balances availability and correctness requirements explicitly.
- Defines recovery expectations and validates them with drills.
- Includes safeguards around destructive operations.
Weak signals
- Assumes cloud-managed services eliminate reliability concerns.
- Uses retries without bounds or idempotency guarantees.
- Omits recovery targets and operational ownership.
- Ignores risk of operator-induced failures.
Example evaluation prompts
- "Design a fault-tolerant ingestion pipeline with strict correctness guarantees."
- "Define reliability controls for a multi-step pipeline during regional service degradation."