Designing Retry-Aware Infrastructure

Notes on queues, idempotency keys, and what it actually means to recover gracefully from failure.

Retries are one of those things that seem simple until they cause an outage.

Why Retries Are Dangerous

Naive retry logic turns a brief hiccup into a thundering herd. If every client retries at the same interval, you get a synchronized flood of requests exactly when the system is most stressed.

Idempotency Keys

The foundation of safe retries is idempotency: the same operation, run twice, produces the same result. In practice this means:

Generating a unique key per logical operation at the call site
Storing that key server-side with the result
On retry, returning the cached result instead of re-executing

What I Used at UCSF × Stanford

In the healthcare streaming pipeline, we used Kafka consumer groups with explicit offset commits. A message is only marked "done" after it's been fully processed and written to Cassandra. If the consumer crashes mid-processing, it re-reads the message from the last committed offset.

Combined with idempotent writes (using a composite key of patient ID + event timestamp), this gives us exactly-once semantics at the application level.

— Amisha

Filed under: Infrastructure

Designing Retry-Aware Infrastructure

01 Why Retries Are Dangerous

02 Idempotency Keys

03 What I Used at UCSF × Stanford

Why Retries Are Dangerous

Idempotency Keys

What I Used at UCSF × Stanford