3 AM Outage: Roll Back or Fix Forward?
A deploy 20 minutes ago correlates with a spike in 500s. You are on call. Defend whether to roll back or fix forward.
the decision you defend
Checkout error rate jumped from 0.1% to 9% right after release v412. You can roll back to v411 in about 2 minutes, or push a fix forward with an unknown ETA. What do you do, and why?
the situation
It is 3:07 AM. PagerDuty wakes you. The checkout service error rate dashboard shows 500s climbing from a 0.1% baseline to 9% over the last six minutes.
context
Release v412 deployed at 02:47 AM, about 20 minutes ago. The deploy pipeline is green. You have a one-command rollback to v411, which last ran clean for nine days. You do not yet know the root cause. A teammate in Slack says "I think I can patch it, give me a sec."
How this challenge works
Take a position on the decision above and defend it. A senior-engineer AI will push back over up to 5 rounds. When you are done, you are scored against a verified rubric so you can see exactly what a complete answer covers - these are learning prompts, not gotchas.