Kubernetes/Kubernetes Storage: Volumes, PVCs, and StatefulSets

Kubernetes Storage: Volumes, PVCs, and StatefulSets

Why pod data dies with the pod, and the abstraction that outlives it: emptyDir, the PV/PVC split, StorageClasses and dynamic provisioning, access modes, reclaim policies, and StatefulSets for stateful apps.

Everything you learned about pods being disposable comes back to bite you the moment your app needs to remember something. A pod's filesystem is born and dies with the pod: crash it, reschedule it, roll it out, and whatever it wrote is gone. That is fine for a stateless web server and a disaster for a database. Kubernetes solves this the way it solves everything - with a layer of abstraction between the thing that wants storage and the thing that supplies it. This guide walks that layer from the bottom up: why ephemeral storage is ephemeral, what PersistentVolumes and PersistentVolumeClaims actually decouple, how a claim turns into a real cloud disk on its own, the access-mode and reclaim-policy traps that eat data, and finally StatefulSets - the controller you reach for when a Deployment is the wrong tool. If the control loop from the fundamentals still feels shaky, read that first; storage is just desired state for disks.

Ephemeral storage: why data dies with the pod

By default, everything a container writes goes to its container filesystem - a writable layer on top of the image. That layer exists only as long as the container exists. Restart the container (a crash, a liveness probe kill) and you get a fresh copy of the image with an empty writable layer. Reschedule the pod to another node and it is a brand-new pod with brand-new storage. Nothing you wrote survives. This is not a bug; it is the whole point of the disposable-pod model. Treat the container filesystem as scratch space that can vanish at any moment.

The next step up is emptyDir, a volume that lives for the lifetime of the pod rather than the container. It is created empty when the pod is scheduled to a node and deleted when the pod is removed from that node. Its usefulness is narrow but real: it survives a container restart within the same pod, and it is shared between all containers in the pod - which is exactly what a sidecar pattern needs (one container writes, another reads). It is also the standard scratch space for caches, temp files, and workspace between init and main containers.

apiVersion: v1
kind: Pod
metadata:
  name: scratch-demo
spec:
  containers:
    - name: app
      image: myapp:1.4.0
      volumeMounts:
        - name: cache
          mountPath: /tmp/cache
  volumes:
    - name: cache
      emptyDir: {}          # gone the moment the pod is removed

Be clear about what "lifetime of the pod" means: it is the pod object, not a logical instance of your app. When a Deployment rolls out a new version, it does not move your pod - it deletes the old pod and creates a new one. The new pod gets a new emptyDir. So emptyDir survives a livenessProbe restart but not a rollout, a rescheduling, or a node failure. There is one useful trick: emptyDir: { medium: Memory } backs the volume with tmpfs (RAM), which is fast but counts against the pod's memory limit and still dies with the pod. None of this is persistence. For data that must outlive the pod, you need storage that is not tied to the pod's lifecycle at all.

The persistence abstraction: PV vs PVC

Kubernetes splits durable storage into two objects, and the split is the whole idea. A PersistentVolume (PV) is a piece of real storage in the cluster - an EBS volume, a GCE persistent disk, an NFS export, a Ceph RBD image. It has a capacity, an access mode, and a lifecycle independent of any pod. A PersistentVolumeClaim (PVC) is a request for storage: "I need 20Gi, ReadWriteOnce." A pod references a PVC, never a PV directly. Kubernetes binds the claim to a PV that satisfies it, and the pod mounts whatever got bound.

Why bother with two objects instead of one? Because it decouples the app author from the storage admin, the same way a Service decouples the client from the pod. The person writing the Deployment says "I need 20Gi of fast storage" without knowing or caring whether that is EBS, a SAN LUN, or a directory on a NAS. The person (or the cloud, as we will see) who provisions storage does not need to know what app will use it. The PVC is the interface between them. This is the recurring Kubernetes shape: a claim on one side, a supply on the other, and the control plane matching them.

# The claim: what the app asks for
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: data
spec:
  accessModes: ["ReadWriteOnce"]
  resources:
    requests:
      storage: 20Gi

# The pod consumes the claim, not the volume
spec:
  containers:
    - name: db
      image: postgres:16
      volumeMounts:
        - name: data
          mountPath: /var/lib/postgresql/data
  volumes:
    - name: data
      persistentVolumeClaim:
        claimName: data

Now the important part: that PVC outlives the pod. Delete the pod, and the PVC (and the PV it is bound to, and the data on it) stays. Create a new pod that references the same claim and it mounts the same data. That is persistence - storage whose lifecycle is deliberately unhooked from the disposable pod. You can see the binding at a glance:

kubectl get pvc              # STATUS Bound means it found a PV
kubectl get pv               # the actual volumes and their CLAIM
kubectl describe pvc data    # if STATUS is Pending, the events say why

A PVC stuck in Pending is the classic first stumble: no PV satisfies the claim, and nothing is creating one. Which is exactly the problem StorageClasses solve.

StorageClasses and dynamic provisioning

In the early days you pre-created PVs by hand and hoped a claim matched one. Nobody does that anymore. A StorageClass describes a kind of storage and, crucially, names a provisioner that can create the underlying disk on demand. When a PVC asks for a StorageClass, the provisioner reaches out to the cloud (or storage system), creates a real volume that fits the request, wraps it in a PV, and binds it to the claim. This is dynamic provisioning: you write a PVC, and a disk appears. No admin, no pre-provisioning.

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: fast-ssd
provisioner: ebs.csi.aws.com     # the CSI driver that talks to AWS
parameters:
  type: gp3
  iops: "3000"
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true

On a managed cluster you rarely write this yourself - EKS, GKE, and AKS ship a default StorageClass already wired to their block-storage CSI driver (gp3 on AWS, pd-balanced on GCP, and so on). A PVC that names no storageClassName gets the default. That is why a bare PVC "just works" on a cloud cluster and sits in Pending forever on a bare-metal cluster with no default class - there is nothing to provision it.

kubectl get storageclass                 # the (default) one is marked
kubectl get sc -o wide

Two fields on that StorageClass are worth internalizing. volumeBindingMode: WaitForFirstConsumer delays creating the disk until a pod that needs it is actually scheduled, so the volume gets created in the same zone as the pod - without it, on a multi-zone cluster, you can provision an EBS volume in us-east-1a and then have the pod scheduled to us-east-1c, where the volume cannot attach (block storage is zonal). allowVolumeExpansion: true lets you grow a PVC later by editing its resources.requests.storage; you cannot shrink it. Get these right up front, because both are painful to discover in production.

Access modes and reclaim policies: where data gets lost

Two settings on PVs and PVCs cause more confusion and more data loss than anything else in this space. Learn them cold.

Access modes: RWO does not mean what you think

An access mode declares how a volume can be mounted:

ReadWriteOnce (RWO) - mountable read-write by a single node.
ReadOnlyMany (ROX) - mountable read-only by many nodes.
ReadWriteMany (RWX) - mountable read-write by many nodes.

The trap is in RWO. Almost everyone reads "Once" as "one pod." It means one node. A single EBS volume or GCE disk is a block device that can only attach to one machine at a time - that is a hardware fact, not a Kubernetes choice. So an RWO volume can be used by multiple pods only if they all land on the same node, and in practice it means one pod. The instant you try to run two replicas of a Deployment that both mount the same RWO PVC and they schedule to different nodes, the second pod hangs in ContainerCreating with a "Multi-Attach error" because the volume is already attached elsewhere.

This is why you cannot just bump a stateful Deployment's replicas to 3 and expect it to work. Standard cloud block storage is RWO, full stop. If you genuinely need many pods writing to the same volume across nodes, you need RWX, and that requires a shared filesystem behind it - NFS, AWS EFS, Azure Files, CephFS. Those exist and work, but they are a different (usually slower, more expensive) class of storage, and reaching for RWX is often a sign you should be using object storage or a database instead of a shared mounted disk. The right pattern for "each replica needs its own persistent disk" is not RWX at all - it is a StatefulSet, below.

Reclaim policies: how Delete has eaten people's data

A PV's reclaim policy decides what happens to the underlying disk when its PVC is deleted:

Delete - the disk itself is deleted along with the PV. Gone.
Retain - the PV and the real disk stick around after the PVC is deleted; you clean it up manually.

Dynamic provisioning defaults to Delete, because for ephemeral test workloads you do not want orphaned disks piling up a cloud bill. But "Delete" means exactly that: kubectl delete pvc data on a Delete-policy volume asynchronously destroys the actual EBS volume and everything on it. There is no trash can, no soft delete, no undo. People have wiped production databases this way - a cleanup script that deleted a namespace, a Helm uninstall, a kubectl delete -f against the wrong file, and the reclaim policy quietly took the data with the claim.

For anything you cannot afford to lose, set the StorageClass or the PV to Retain. Then deleting the PVC releases the PV (it goes to Released status) but leaves the disk and its data intact, so a mistake is recoverable. You pay for the leftover disk and clean it up deliberately, which is the correct tradeoff for stateful data. You can also patch an existing PV:

kubectl patch pv <pv-name> -p '{"spec":{"persistentVolumeReclaimPolicy":"Retain"}}'

Make Retain the default for any StorageClass that backs real state, and treat Delete as the setting for scratch and CI only. This one line has saved more production data than most backup strategies.

StatefulSets: when Deployments are wrong

A Deployment assumes its pods are interchangeable - identical, nameless, replaceable in any order. That assumption is false for databases, message brokers, and clustered stores, where each replica has an identity, holds its own slice of data, and must come up and go down in a specific order. Forcing that onto a Deployment breaks in exactly the ways the previous sections predict: pods fight over one RWO volume, replicas have no stable names to form a cluster, and a rolling update tears down members in random order. The StatefulSet is the controller built for this, and it gives you four things a Deployment cannot.

Stable network identity. StatefulSet pods are named by ordinal, not by a random hash: db-0, db-1, db-2. That name is stable across rescheduling - if db-1 dies, its replacement is still db-1, with the same DNS name. This is what lets cluster members find each other: a Postgres replica can be told "your primary is db-0" and that stays true.

Ordered, graceful lifecycle. Pods are created in order (db-0 fully Ready before db-1 starts) and deleted in reverse (db-2 first). For a quorum-based system - etcd, ZooKeeper, a database with a bootstrap node - this ordering is not a nicety, it is correctness.

Per-pod persistent storage via volumeClaimTemplates. This is the key mechanism. Instead of one shared PVC, a StatefulSet has a volumeClaimTemplate, and the controller stamps out a separate PVC for each pod: data-db-0, data-db-1, data-db-2. Each pod gets its own RWO volume with its own data, and - critically - that PVC is sticky: reschedule db-1 to another node and it reattaches to data-db-1, the same disk with the same data. This is how you run N stateful replicas on RWO cloud block storage without the multi-attach problem, because no volume is ever shared.

A required headless Service. A StatefulSet needs a headless Service (clusterIP: None) to give each pod its stable DNS record: db-0.db.default.svc.cluster.local. A normal Service load-balances across a virtual IP, which is the opposite of what you want here - you need to address specific members. The headless Service returns per-pod DNS instead of a single VIP.

apiVersion: v1
kind: Service
metadata:
  name: db
spec:
  clusterIP: None          # headless: per-pod DNS, no load-balancing VIP
  selector:
    app: db
  ports:
    - port: 5432
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: db
spec:
  serviceName: db          # must point at the headless Service above
  replicas: 3
  selector:
    matchLabels:
      app: db
  template:
    metadata:
      labels:
        app: db
    spec:
      containers:
        - name: postgres
          image: postgres:16
          ports:
            - containerPort: 5432
          volumeMounts:
            - name: data
              mountPath: /var/lib/postgresql/data
  volumeClaimTemplates:    # one PVC per pod, not shared
    - metadata:
        name: data
      spec:
        accessModes: ["ReadWriteOnce"]
        storageClassName: fast-ssd
        resources:
          requests:
            storage: 20Gi

Two sharp edges worth knowing before you ship one. First, those per-pod PVCs outlive the StatefulSet by design - scale down from 3 to 1 and the PVCs for db-1 and db-2 are not deleted, so their data is preserved if you scale back up. Deleting the StatefulSet leaves them too. That is the safe default, but it means scaling down does not reclaim disk, and you clean those PVCs up by hand. Second, StatefulSets do the mechanical parts (identity, ordering, storage) but they know nothing about your app's semantics - they will not run a leader election, promote a replica, or restore from backup. That logic lives in your init scripts or, in the real world, in an Operator.

The honest take: should you run databases in Kubernetes at all?

Everything above makes it possible to run a stateful database in Kubernetes. Whether you should is a separate question, and the honest answer for most teams is: probably not the ones that matter, not without a very good reason.

The case against is about blast radius. A managed database (RDS, Cloud SQL, managed MongoDB) hands the hardest problems to someone whose full-time job is getting them right: backups that are actually tested, point-in-time recovery, failover, patching, replication, storage that grows without an outage. Running the same database yourself on Kubernetes means you own all of that. And the surface area is unforgiving - you have already seen two ways to lose the data in this guide alone (a Delete reclaim policy and a multi-attach mistake), and there are more. A stateless app that Kubernetes reschedules badly costs you a few retried requests; a database that Kubernetes reschedules badly, or a StatefulSet whose backups were never verified, is a business-ending incident. Same platform, wildly different cost of getting it wrong.

The case for has narrowed but is real. If you run a serious stateful workload on Kubernetes, do it through a mature Operator (CloudNativePG, the Zalando Postgres operator, Strimzi for Kafka, the Vitess operator) - a controller that encodes the operational knowledge StatefulSets lack: it handles failover, backups, restores, and version upgrades as first-class operations. Do not hand-roll a StatefulSet for a production database and call it done; the StatefulSet gives you identity and storage, the Operator gives you operations. The legitimate reasons to run stateful in-cluster are wanting one control plane and one set of tooling, portability across clouds or on-prem where a managed service does not exist, or cost and latency at a scale where managed pricing stops making sense. Those are good reasons. "It was easy to kubectl apply" is not.

My default: stateless everything on Kubernetes, databases on the managed service unless you have both a concrete reason not to and the operational maturity (a real Operator, tested restores, someone on call who understands the storage layer) to own the consequences. The same blast-radius reasoning shows up whenever people try to cost-optimize stateful infrastructure - the spot instances for the database challenge is the same tradeoff wearing a different hat: the discount is identical to the stateless tier, the failure cost is not.

The shape of it

Persistence in Kubernetes is one idea working around one constraint: pods are disposable, so durable data has to live in something that is not the pod. The container filesystem and emptyDir are ephemeral by design and die with the container or the pod. Real persistence comes from the PV/PVC split - a claim on one side, a supply on the other, matched by the control plane, which decouples the app author from the storage. StorageClasses close the loop with dynamic provisioning, turning a PVC into a real cloud disk automatically, with WaitForFirstConsumer keeping it in the right zone. The two settings that bite are access modes (RWO means one node, not one pod) and reclaim policies (Delete destroys the disk, so use Retain for anything that matters). And when pods stop being interchangeable, the StatefulSet gives each one a stable identity, ordered lifecycle, and its own sticky PVC via a volumeClaimTemplate behind a headless Service. Just remember that the platform making it possible to run a database in-cluster is not the same as it being wise to - reach for a managed service or a real Operator, and keep the blast radius in mind.

All guides Join the community