intermediatekubernetesreliabilityobservability~12 min5 rounds

Pods Keep OOMKilling: Scale Up or Fix the Leak?

A service OOMKills every few hours. Adding replicas or raising memory limits buys time. Defend whether to scale around it or fix the root cause.

the decision you defend

A payments worker pod gets OOMKilled every 3 to 4 hours under normal load. You can raise the memory limit and add replicas in minutes, or dig into a suspected memory leak with unknown effort. Production is currently degraded but not down. What do you do, and why?

the situation

A Kubernetes payments worker (payments-worker) is being OOMKilled roughly every three to four hours under normal traffic. Each kill drops a few in-flight jobs, which retry, so customers see occasional delays but not hard failures.

context

The pod has a 512Mi memory limit. Memory climbs steadily from startup until the kill, then resets on restart, a classic sawtooth. You have a heap profiler available but have not captured a profile yet. A colleague suggests "just bump it to 2Gi and add two more replicas, we can look at it next sprint."

How this challenge works

Take a position on the decision above and defend it. A senior-engineer AI will push back over up to 5 rounds. When you are done, you are scored against a verified rubric so you can see exactly what a complete answer covers - these are learning prompts, not gotchas.