Kubernetes/Debugging Kubernetes: Diagnose Pods and Fix Failures

Debugging Kubernetes: Diagnose Pods and Fix Failures

A real Kubernetes incident playbook: read STATUS, describe the Events, and pull the logs to diagnose Pending, ImagePullBackOff, CrashLoopBackOff, OOMKilled, and stuck pods fast.

The fundamentals guide closes with the debugging method in three lines: read the STATUS, describe for the events, then logs for the app's own story. That is the right instinct, and it resolves most incidents. This guide goes deeper. It is the playbook I actually run when a pod is stuck at 3 a.m. - what each failure state really means, the exact command that confirms the root cause instead of guessing, and the tools you reach for when the basic three are not enough. The thread running through all of it is one discipline: read reality before you change it.

The method, and why delete-and-recreate is a trap

The whole debugging loop is three commands, in this order, every time:

kubectl get pods                 # what STATUS is it stuck in?
kubectl describe pod <name>      # read the Events at the bottom - they explain WHY
kubectl logs <name> --previous   # if it crashed, what did it print on the way out?

get tells you the state. describe tells you why the control plane thinks the pod is in that state - the Events section at the bottom is the single most useful thing in Kubernetes debugging, because it is the scheduler and kubelet narrating their decisions ("0/3 nodes available", "Failed to pull image", "Liveness probe failed"). logs tells you the application's side of the story, and --previous recovers the logs of the container that already died - which is exactly the one you need when a pod is crash-looping.

Now the trap. When a pod is broken, the reflex is kubectl delete pod <name> and let it come back fresh. Resist it. In almost every case the pod is managed by a Deployment, so the ReplicaSet controller immediately creates an identical replacement - same image, same config, same bug - and it lands in the same broken state. You have not fixed anything; you have destroyed the evidence. The crashed container's --previous logs are gone, the Events on the old pod are gone, and you are back to square one with less information.

This is the control loop from fundamentals biting you. Deleting a pod changes reality, and the controller's entire job is to drag reality back to the desired state stored in etcd. If the desired state is broken (a bad image tag, a missing Secret, an OOM-inducing limit), recreating the pod just re-runs the same broken reconciliation. You fix Kubernetes incidents by changing the desired state (the Deployment spec) or the actual bug (the app), not by churning pods. Diagnose first, on the pod that is currently failing, while the evidence is live.

Pending: the pod cannot be scheduled

A Pending pod has been accepted by the API server but the scheduler has not placed it on a node. The container is not even trying to start yet - this is a placement problem, not an app problem. Go straight to the Events:

kubectl describe pod <name>
# Events:
#   Warning  FailedScheduling  0/3 nodes are available: 3 Insufficient memory.

The message is usually explicit. Read the "N/M nodes are available" line and the reasons after it:

Insufficient cpu / memory - no node has enough unreserved capacity to cover the pod's resource requests. Note: this is about requests, not actual usage. A cluster can look 40% utilized and still refuse to schedule because requests are reserved even when idle. Confirm with kubectl describe nodes (look at "Allocated resources") or kubectl top nodes for real usage.
Node(s) had untolerated taint - the pod does not tolerate a taint on the available nodes (control-plane nodes are tainted NoSchedule by default; some node pools are tainted for GPUs or spot). Check with kubectl describe node <node> | grep Taints and add a matching toleration, or target a different pool.
Node(s) didn't match node selector / affinity - a nodeSelector or nodeAffinity rule on the pod matches no node's labels. Verify the labels exist: kubectl get nodes --show-labels. A common one is requiring disktype=ssd on a cluster where nobody labelled the nodes.
N node(s) didn't match pod anti-affinity rules - a spread rule (e.g. "no two replicas on the same node") cannot be satisfied because you asked for more replicas than nodes.

The confirming command for the resource case:

kubectl describe nodes | grep -A 6 "Allocated resources"   # what is reserved per node
kubectl get pod <name> -o jsonpath='{.spec.containers[*].resources.requests}'   # what this pod asked for

The fix is either add capacity (scale the node pool, or Cluster Autoscaler if configured), lower the pod's requests if they are inflated, or relax the constraint. Deleting the pod does nothing - a fresh pod is just as unschedulable.

ImagePullBackOff and ErrImagePull: the node cannot get the image

The kubelet asked the container runtime to pull the image and it failed. ErrImagePull is the first failure; ImagePullBackOff is Kubernetes backing off and retrying with growing delay. The Events name the exact reason:

kubectl describe pod <name>
# Events:
#   Warning  Failed  Failed to pull image "myapp:1.5.0": ... not found
#   Warning  Failed  Error: ErrImagePull

The causes, in rough order of how often I hit them:

Typo in the name or tag - myapp:1.5.0 when you pushed 1.5 or v1.5.0, or a registry path wrong by one segment. The error says manifest unknown or not found. Confirm the tag really exists in the registry (pull it yourself: docker pull myapp:1.5.0).
Private registry, no or wrong pull secret - the error says unauthorized, authentication required, or pull access denied. The pod needs an imagePullSecret, or the node's identity (IRSA on EKS, workload identity on GKE) is not authorized. Check the pod actually references the secret: kubectl get pod <name> -o jsonpath='{.spec.imagePullSecrets}', and that the secret exists and is the right type: kubectl get secret <name> -o yaml (should be kubernetes.io/dockerconfigjson).
Rate limited - Docker Hub anonymous pull limits show as toomanyrequests. Authenticate the pulls or mirror the image.
Wrong architecture - the image has no manifest for the node's arch (an amd64-only image on arm64 nodes). The error mentions no matching manifest.

The single fastest confirmation is just reading the words after Failed to pull image in the Events - not found vs unauthorized splits the two most common causes instantly.

CrashLoopBackOff: the app is failing on boot

This is the one people fear and the one where the method pays off most. CrashLoopBackOff means the container starts, then exits or crashes, and Kubernetes restarts it with exponentially growing back-off (10s, 20s, 40s, up to 5 minutes). The pod is scheduled and pulling fine - the problem is the process itself is dying. The restart count in kubectl get pods climbs:

kubectl get pods
# NAME   READY   STATUS             RESTARTS      AGE
# web    0/1     CrashLoopBackOff   6 (30s ago)   4m

Here the essential command is --previous, because by the time you look, the current container is in back-off and empty; the story is in the container that already died:

kubectl logs <name> --previous       # stdout/stderr of the crashed container
kubectl describe pod <name>          # Last State + exit code + probe failures

In describe, find Last State: Terminated and its Reason and Exit Code. That exit code narrows it fast:

Exit Code 1 / app-specific - the application crashed on startup. The --previous logs almost always show it: a missing or malformed env var, a config file it cannot find, a database or dependency it cannot reach on boot, a failed migration, an unhandled exception. This is the majority. Read the logs; the app usually tells you exactly what it wanted.
Exit Code 137 - the container received SIGKILL. Combined with Reason: OOMKilled this is a memory kill (see the next section). Without OOMKilled, something (often a liveness probe) is killing it.
Exit Code 127 - command not found. The container's command/args or entrypoint points at a binary that is not there. Common with a wrong image or a typo'd command.
Exit Code 0 - the process ran and exited successfully, but Kubernetes expects long-running pods to stay up, so it restarts and loops. This is usually a batch job accidentally deployed as a Deployment, or an entrypoint that does its work and returns.

There is one CrashLoopBackOff that is not the app's fault at all: a too-aggressive liveness probe restarting a healthy pod. If describe shows Events like Liveness probe failed: ... connection refused shortly after each start, and the app's own logs look fine (it logged "listening on 8080" then got killed), the probe is the culprit. Classic causes: initialDelaySeconds too short so the probe fires before a slow app finishes booting (fix with a startupProbe or a longer initial delay), or the liveness endpoint does a deep health check that also fails when a downstream dependency is slow - so Kubernetes restarts a perfectly healthy pod because the database is lagging. Keep liveness shallow ("is the process wedged?") and push dependency checks into readiness. This exact trap - app is fine, probe is killing it - is a favorite because the obvious move (dig into the app) is the wrong one.

Want to argue this one out against a senior engineer who pushes back? The CrashLoopBackOff challenge drops you into a live crash-looping service and makes you defend your diagnosis.

OOMKilled: over the memory limit

When a container exceeds its memory limit, the kernel's OOM killer terminates it - Kubernetes does not gracefully ask, memory is not compressible, so the process is killed with SIGKILL (exit code 137). You see it in the pod's state, not just the logs:

kubectl get pods
# NAME   READY   STATUS      RESTARTS   AGE
# web    0/1     OOMKilled   3          12m         # or CrashLoopBackOff with OOMKilled inside

kubectl describe pod <name>
# Last State:     Terminated
#   Reason:       OOMKilled
#   Exit Code:    137

Last State: Terminated, Reason: OOMKilled is the confirmation. Now the real question, and the one that separates a fix from a band-aid: is the limit too low, or is the app leaking / genuinely needs more? Look at actual usage against the limit:

kubectl top pod <name> --containers        # live memory usage per container
kubectl get pod <name> -o jsonpath='{.spec.containers[*].resources.limits.memory}'

If the app steadily uses, say, 300Mi of work but the limit is 256Mi, the limit is simply too low for the real working set - raise it. But if usage climbs continuously until it hits the limit, then resets on restart, and climbs again, that is a memory leak and raising the limit only delays the next kill (and makes each kill more disruptive). The correct move there is to fix the leak, not to keep bumping the ceiling.

The tempting wrong answer under pressure is "just give it more memory" or "add more replicas" - which spreads the same leaking process across more pods that all OOM in turn, burning cluster capacity while fixing nothing. Deciding between scaling and fixing under a real OOM incident is the whole point of the OOMKilled: scale vs fix challenge. Also remember the request/limit asymmetry from fundamentals: over the memory limit gets you killed; over the CPU limit only throttles you. If a pod is slow but not dying, you are looking at CPU throttling, not OOM - check kubectl top pod for CPU pinned at the limit.

Running but not Ready: the readiness gate

The pod shows Running but 0/1 in the READY column. The container is up and has not crashed - but its readiness probe is failing, so the pod is pulled out of its Service's load-balancing rotation and gets zero traffic. This is not a crash; it is Kubernetes deciding the pod is not fit to serve.

kubectl get pods
# NAME   READY   STATUS    RESTARTS   AGE
# web    0/1     Running   0          2m

kubectl describe pod <name>
# Events:
#   Warning  Unhealthy  Readiness probe failed: HTTP probe returned 503

The Events tell you the probe is failing and how (bad status code, connection refused, timeout). Then the question is why the readiness endpoint says "not ready":

The app is still warming up - slow init, cache priming, JIT. If it eventually flips to Ready, the probe's initialDelaySeconds may just be fine and you were watching a normal warmup. If it never flips, keep reading.
A dependency is down - a well-designed readiness endpoint returns unhealthy when a required downstream (database, cache, upstream API) is unreachable, deliberately shedding traffic until it recovers. So a stuck 0/1 often means "the app is fine but its dependency is not." Check what the readiness endpoint actually verifies.
The probe itself is misconfigured - wrong path, wrong port, or timeoutSeconds too short for a slow endpoint. Confirm by hitting the endpoint yourself from inside (see exec below) and comparing to the probe spec.

Because the pod is Running, you can get in and test the readiness endpoint directly - which no amount of staring at YAML will do:

kubectl exec -it <name> -- curl -s -o /dev/null -w '%{http_code}\n' localhost:8080/ready

Terminating (stuck): the pod will not die

You deleted a pod (or a Deployment rollout is replacing it) and it sits in Terminating forever. Termination in Kubernetes is: send SIGTERM, wait up to terminationGracePeriodSeconds (default 30s), then SIGKILL - and only after all finalizers clear does the object leave etcd. A stuck Terminating means one of those stages is jammed.

kubectl get pod <name> -o jsonpath='{.metadata.finalizers}'   # any finalizers still set?
kubectl get pod <name> -o yaml | grep -A3 deletionTimestamp   # confirm it is being deleted

The three real causes:

Finalizers not clearing - a finalizer is a hook that blocks deletion until some external cleanup (detach a volume, deregister from a load balancer, tear down a custom resource) completes and removes the finalizer. If the controller responsible is down or wedged, the pod hangs indefinitely. Diagnose by reading the finalizer names; fix by making the responsible controller do its job. Force-removing finalizers with kubectl patch is a last resort that can orphan the very cloud resources the finalizer was protecting - understand what you are skipping first.
App ignores SIGTERM - the container catches or ignores the termination signal (or PID 1 is a shell that does not forward signals to the real process), so it never shuts down and Kubernetes waits out the full grace period before SIGKILL. Fix the app to trap SIGTERM and exit, or run it as PID 1 properly (exec in the entrypoint, or an init like tini).
Grace period too long - a deliberately huge terminationGracePeriodSeconds makes a slow drain look "stuck" when it is just waiting. Check the spec before assuming a bug.

If you genuinely must force it and understand the consequences:

kubectl delete pod <name> --grace-period=0 --force

That skips the graceful shutdown and removes the object from the API even if the container is still running underneath - use it knowingly, not reflexively.

Tools beyond the basic three

When STATUS -> describe -> logs is not enough, this is the next tier.

Events, cluster-wide and sorted. describe shows one object's events, but events are their own objects and Kubernetes only keeps them ~1 hour. To see everything happening and in time order:

kubectl get events --sort-by=.metadata.creationTimestamp
kubectl get events --field-selector involvedObject.name=<pod>   # just this pod's events
kubectl get events -A --sort-by=.lastTimestamp | tail -30       # cluster-wide, most recent

This catches things a single describe misses - a node going NotReady, evictions, quota rejections, a storm of failures across many pods pointing at one shared cause.

A shell inside the container. When the app is running but misbehaving, get in and look:

kubectl exec -it <name> -- /bin/sh      # or /bin/bash
# now: check env vars, config files, hit localhost endpoints, run the app's own CLI

Ephemeral debug containers, for distroless images. Modern images are often distroless or scratch - no shell, no curl, no ps, so exec fails with "executable file not found." kubectl debug attaches an ephemeral container into the running pod, sharing its network and (optionally) process namespace, so you bring your own tools without rebuilding the image:

kubectl debug -it <name> --image=nicolaka/netshoot --target=<container>
# now you have curl, dig, nslookup, tcpdump, ss - against the pod's own network namespace

--target shares the process namespace so you can see the app's processes too. This is the single biggest upgrade to Kubernetes debugging in recent versions - it means "the image has no shell" stops being a dead end.

Copy a broken pod to poke at it. kubectl debug <pod> --copy-to=debug-web --set-image=*=busybox --share-processes clones the pod so you can experiment without touching the live one - useful when you cannot afford to disturb the failing pod but need to change its command.

port-forward, to isolate a layer. If a Service is not responding, you do not know if the app is broken or the Service/network is. Cut the network out of the picture by tunneling straight to a pod:

kubectl port-forward pod/<name> 8080:8080     # laptop:8080 -> this pod's :8080
# then curl localhost:8080 - if it works, the app is fine and the problem is Service/DNS/Ingress
kubectl port-forward svc/web 8080:80          # via the Service, to test the Service layer

top, for resource pressure. kubectl top pods and kubectl top nodes (requires metrics-server) show live CPU and memory. This is how you catch a node under memory pressure evicting pods, a CPU-throttled pod that is slow but not dying, or the leaking container in an OOM investigation.

kubectl top pods -A --sort-by=memory
kubectl top nodes

Networking and DNS: "the service returns nothing"

A whole class of incidents has healthy pods and a Service that returns nothing. The fundamentals guide flagged the cause: a Service routes by label selector, so if its selector matches no pods, it has no backends and silently drops traffic. The confirming command is get endpoints:

kubectl get endpoints <service>
# NAME   ENDPOINTS                     AGE
# web    10.1.2.3:8080,10.1.2.4:8080   5m     # good: real pod IPs behind it
# web    <none>                        5m     # BAD: empty - selector matches nothing

Empty endpoints (<none>) is the smoking gun. Every pod can be Running and Ready and the Service still serves nothing because it is wired to zero of them. The fix is to reconcile the label mismatch - compare the Service selector to the pods' labels:

kubectl get service <service> -o jsonpath='{.spec.selector}'   # what the Service looks for
kubectl get pods --show-labels                                  # what the pods actually have

A subtler variant: endpoints exist but only lists ready pods, so if pods are 0/1 (readiness failing, from earlier) they are correctly excluded and the Service has fewer or no backends. Endpoints and readiness are linked.

For DNS and connectivity, test from inside a pod, because that is where your app lives - name resolution and network reachability from your laptop tell you nothing about the cluster's internal DNS. Use a pod that has tools, or a netshoot debug container:

# from inside a pod (exec in, or kubectl debug with netshoot):
nslookup web                              # resolve a Service in the same namespace
nslookup web.prod.svc.cluster.local       # fully-qualified cross-namespace
nslookup kubernetes.default               # sanity-check DNS itself works at all
wget -qO- http://web:80/                  # can we actually reach the Service?

If nslookup web fails but nslookup web.prod.svc.cluster.local works, it is a namespace/search-domain issue. If DNS resolves but the connection hangs, suspect a NetworkPolicy blocking the traffic, a wrong targetPort, or the endpoints being empty (back to get endpoints). If even kubernetes.default fails to resolve, cluster DNS itself is broken - check the CoreDNS pods in kube-system (kubectl get pods -n kube-system -l k8s-app=kube-dns and their logs). Intermittent DNS failures under load - some requests resolve, some do not - are their own nasty pattern, worked through end-to-end in the intermittent Service DNS challenge.

The incident runbook

Tying it together, this is the order I actually work a Kubernetes incident:

Read the STATUS. kubectl get pods -o wide (add -A if you are not sure of the namespace). The STATUS and RESTARTS columns already narrow it to a handful of cases.
Describe the pod and read the Events. kubectl describe pod <name> - the Events at the bottom and the Last State / exit code tell you why in the scheduler's and kubelet's own words. Do not skip this to jump to logs.
Read the logs, including --previous. kubectl logs <name> --previous for anything that crashed - it is the dying container's testimony and it disappears the moment you delete the pod.
Match the state to its root cause: Pending = scheduling (resources/taints/affinity); ImagePull* = image name/tag/pull secret; CrashLoopBackOff = app boot failure or an over-eager liveness probe (check --previous and the exit code); OOMKilled = memory limit vs real usage; Running-but-not-Ready = readiness probe or a down dependency; Terminating-stuck = finalizers or SIGTERM not honored.
If the network is in question, kubectl get endpoints <service> first (empty = selector mismatch), then test DNS and connectivity from inside a pod with a netshoot debug container.
Reach for the deeper tools when needed: get events --sort-by for the timeline, exec -it or kubectl debug for a shell (including into distroless images), port-forward to isolate app vs network, top for resource pressure.
Fix the desired state or the bug - not the pod. Change the Deployment spec (image, limits, probes, requests) or the application code, then let the control loop reconcile. Deleting the pod without a diagnosis just recreates the same failure and burns your evidence.

That last point is the whole philosophy of debugging a declarative system. Kubernetes is always trying to make reality match what you asked for, so if reality is broken, the answer is almost never "churn reality" - it is "figure out what you asked for that was wrong, and ask for the right thing." Read before you change. If you want to practice arguing these calls under pushback from a senior engineer, the Kubernetes challenges put you on the spot with real crash, OOM, and DNS incidents.

All guides Join the community