advancedobservabilityperformanceincident-response~15 min5 rounds

p99 Latency Spiked Across a Microservice Chain

p99 latency jumped across a request that touches six services. A teammate wants to scale up the slowest-looking one. Defend how you find the real culprit.

the decision you defend

Your p99 latency jumped on a user request that flows through six microservices. Average latency looks fine. A teammate points at the service with the highest CPU and says scale that one up. How do you actually find the culprit?

the situation

Users report the app feeling slow. Your dashboards show p99 latency has jumped sharply on a key request that passes through about six microservices, while average latency looks normal.

context

You have per-service CPU and memory metrics and request logs, and distributed tracing is available. One service in the chain is running noticeably hotter on CPU than the others. A teammate wants to scale that hottest service up and see if it helps.

How this challenge works

Take a position on the decision above and defend it. A senior-engineer AI will push back over up to 5 rounds. When you are done, you are scored against a verified rubric so you can see exactly what a complete answer covers - these are learning prompts, not gotchas.