DOCUMENTATION
Data Ingestion
Evaluate an agent's ability to ingest source data reliably and correctly.
What this competency is
Designing robust ingestion from source systems into data platforms, including batch, streaming, CDC, and API-based integration.
Why it matters
Ingestion is where source reality enters the platform. Errors here propagate downstream and compromise trust across all analytics and ML workflows.
What to evaluate in agents
- Selection of ingestion pattern based on latency, volume, and source constraints.
- Handling of idempotency, deduplication, and late-arriving data.
- Approach to backfills, replay, and incremental loads.
- Clear treatment of source contracts and schema drift.
Strong signals
- Describes checkpoints or watermarks for incremental processing.
- Includes retry, dead-letter, and replay strategy.
- Distinguishes full reload from incremental and CDC paths.
- Addresses source rate limits and operational constraints.
Weak signals
- Assumes exactly-once semantics without design support.
- Omits replay and backfill behavior.
- Ignores source-side changes and contract breaks.
- Uses one ingestion pattern for all workloads without trade-off analysis.
Example evaluation prompts
- "Design CDC ingestion from Postgres into a lakehouse with hourly SLA."
- "Propose a resilient API ingestion pattern for rate-limited vendor data."