advancedreliabilitydistributed-systemsincident-response~14 min4 rounds

A 30-Second Blip Became a 2-Hour Outage. More Retries?

A brief downstream blip cascaded into a full-platform meltdown because every layer retried. A teammate wants to raise retry counts. Defend the real fix.

the decision you defend

A downstream payment service blipped for 30 seconds. Retries at the client, the API gateway, and two internal services piled on, and the whole platform stayed down for 2 hours after the payment service had already recovered. In the postmortem a teammate proposes raising retry counts from 3 to 5 so requests eventually succeed next time. What do you do, and why?

the situation

Last Tuesday a downstream payment service had a 30-second blip - a node restarted and health checks briefly failed. The platform did not recover in 30 seconds. Latency climbed, queues filled, and the entire platform was effectively down for 2 hours, long after the payment service itself was healthy again.

context

The postmortem timeline shows retry logic at four places: the mobile client retries 3 times, the API gateway retries 3 times, and two internal services each retry 3 times with no backoff. During the incident the payment service saw more than 20x its normal request rate, almost all of it retries, and could not climb out. In the postmortem review a teammate proposes: "The requests failed because we gave up too early. Raise retry counts from 3 to 5 across the board so requests eventually succeed." You own the reliability recommendation.

How this challenge works

Take a position on the decision above and defend it. A senior-engineer AI will push back over up to 4 rounds. When you are done, you are scored against a verified rubric so you can see exactly what a complete answer covers - these are learning prompts, not gotchas.