All writing
InfrastructureApr 20266 min read

Designing Retry-Aware Infrastructure

Notes on queues, idempotency keys, and what it actually means to recover gracefully from failure.

Retries are one of those things that seem simple until they cause an outage.

Why Retries Are Dangerous

Naive retry logic turns a brief hiccup into a thundering herd. If every client retries at the same interval, you get a synchronized flood of requests exactly when the system is most stressed.

Idempotency Keys

The foundation of safe retries is idempotency: the same operation, run twice, produces the same result. In practice this means:

  • Generating a unique key per logical operation at the call site
  • Storing that key server-side with the result
  • On retry, returning the cached result instead of re-executing

What I Used at UCSF × Stanford

In the healthcare streaming pipeline, we used Kafka consumer groups with explicit offset commits. A message is only marked "done" after it's been fully processed and written to Cassandra. If the consumer crashes mid-processing, it re-reads the message from the last committed offset.

Combined with idempotent writes (using a composite key of patient ID + event timestamp), this gives us exactly-once semantics at the application level.

— Amisha

Filed under: Infrastructure

Next in this series
Your Agent Has Amnesia. Here's the Dict That Fakes Memory.
Week 2