advanceddatabasesreliabilitydisaster-recovery~14 min4 rounds

Failover Lost 5 Minutes of Writes. Promise Zero Data Loss?

An async-replica failover dropped five minutes of committed writes. Leadership demands zero data loss with zero performance impact. Defend the honest trade-off.

the decision you defend

Your primary database died and you failed over to an async replica. Five minutes of committed writes - orders customers got confirmations for - are gone. Leadership says: guarantee this never happens again, with zero data loss and zero performance impact. What do you tell them, and what do you actually build?

the situation

At 02:10 the primary database's host died. Failover promoted the async replica and the app came back at 02:24. Then support tickets started: customers have order-confirmation emails for orders that do not exist. The replica was about five minutes behind at the moment of failure, and every write in that window - roughly 1,900 transactions, including around 140 paid orders - is gone.

context

Replication has always been asynchronous; it was set up years ago and nobody revisited it. The morning after, the CTO puts it simply: "Committed means committed. I want a guarantee this can never happen again - zero data loss - and it obviously cannot slow the product down, we compete on speed." You are asked to come back tomorrow with the plan.

How this challenge works

Take a position on the decision above and defend it. A senior-engineer AI will push back over up to 4 rounds. When you are done, you are scored against a verified rubric so you can see exactly what a complete answer covers - these are learning prompts, not gotchas.