Designing Retry-Aware Infrastructure
Notes on queues, idempotency keys, and what it actually means to recover gracefully from failure.
Retries are one of those things that seem simple until they cause an outage.
Why Retries Are Dangerous
Naive retry logic turns a brief hiccup into a thundering herd. If every client retries at the same interval, you get a synchronized flood of requests exactly when the system is most stressed.
Idempotency Keys
The foundation of safe retries is idempotency: the same operation, run twice, produces the same result. In practice this means:
- Generating a unique key per logical operation at the call site
- Storing that key server-side with the result
- On retry, returning the cached result instead of re-executing
What I Used at UCSF × Stanford
In the healthcare streaming pipeline, we used Kafka consumer groups with explicit offset commits. A message is only marked "done" after it's been fully processed and written to Cassandra. If the consumer crashes mid-processing, it re-reads the message from the last committed offset.
Combined with idempotent writes (using a composite key of patient ID + event timestamp), this gives us exactly-once semantics at the application level.
— Amisha
Filed under: Infrastructure