intermediatecachingperformancereliability~12 min4 rounds

Every Deploy Hammers the Database for 10 Minutes. Bigger DB?

Each deploy flushes the cache and the database melts for 10 minutes while it refills. A teammate wants to double the database size. Defend fixing the stampede instead of paying for peak.

the decision you defend

Every deploy flushes your Redis cache, and for the next 10 minutes the database runs at 95% CPU serving a flood of identical queries while the cache refills. Users see timeouts. A teammate proposes doubling the database instance size so it can absorb the refill. Do you upsize the database, or fix something else? Defend your call.

the situation

Your team deploys a mid-sized e-commerce API three or four times a day. The deploy script flushes the entire Redis cache "to avoid stale data", and for roughly 10 minutes afterward the primary Postgres instance sits at 90-95% CPU while the cache refills. During that window, p99 latency goes from 80ms to 6 seconds and a few percent of requests time out. Support has started asking why the site "breaks every afternoon".

context

The database is an r6g.xlarge that runs at 25-30% CPU outside the refill window. Cache hit rate is normally around 97%; product and pricing queries are the expensive ones being duplicated during refill. There is no request coalescing anywhere in the stack, TTLs are a uniform 15 minutes, and the deploy pipeline runs a plain FLUSHALL. A teammate has already priced the next instance size up and posted in the channel: "double the DB and this whole problem disappears, it is only money."

How this challenge works

Take a position on the decision above and defend it. A senior-engineer AI will push back over up to 4 rounds. When you are done, you are scored against a verified rubric so you can see exactly what a complete answer covers - these are learning prompts, not gotchas.