0
votes

Scenario 1:

I have 3 local-persistent-volumes provisioned, each pv is mounted on different node:

  • 10.30.18.10
  • 10.30.18.11
  • 10.30.18.12

When I start my app with 3 replicas using:

kind: StatefulSet
metadata:
  name: my-db
spec:
  replicas: 3
...
...
  volumeClaimTemplates:
  - metadata:
      name: my-local-vol
    spec:
      accessModes: [ "ReadWriteOnce" ]
      storageClassName: "my-local-sc"
      resources:
        requests:
          storage: 10Gi

Then I notice pods and pvs are on the same host:

  • pod1 with ip 10.30.18.10 has claimed the pv that is mounted on 10.30.18.10
  • pod2 with ip 10.30.18.11 has claimed the pv that is mounted on 10.30.18.11
  • pod3 with ip 10.30.18.12 has claimed the pv that is mounted on 10.30.18.12

(whats not happening is: pod1 with ip 10.30.18.10 has claimed the pv that is mounted on different node 10.30.18.12 etc)

The only common config between pv and pvc is storageClassName, so I didn't configure this behavior.

Question: So, who is responsible for this magic? Kubernetes scheduler? Kubernetes provisioner?


Scenario 2:

I have 3 local-persistent-volumes provisioned:

  • pv1 has capacity.storage of 10Gi
  • pv2 has capacity.storage of 100Gi
  • pv3 has capacity.storage of 100Gi

Now, I start my app with 1 replica

kind: StatefulSet
metadata:
  name: my-db
spec:
  replicas: 1
...
...
  volumeClaimTemplates:
  - metadata:
      name: my-local-vol
    spec:
      accessModes: [ "ReadWriteOnce" ]
      storageClassName: "my-local-sc"
      resources:
        requests:
          storage: 10Gi

I want to ensure that this StatefulSet always claim pv1 (10Gi) even this is on a different node, and don't claim pv2 (100Gi) and pv3 (100Gi)

Question:

Does this happen automatically?

How do I ensure the desired behavior? Should I use a separate storageClassName to ensure this?

What is the PersistentVolumeClaim policy? Where can I find more info?


EDIT:

yml used for StorageClass:

kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: my-local-pv
provisioner: kubernetes.io/no-provisioner
volumeBindingMode: WaitForFirstConsumer
1
I suggest reading the documentation: kubernetes.io/docs/concepts/storage/persistent-volumeswhites11
I have definitely read everything that I could find, and still have these puzzles.Sida Zhou
Just to be sure, my-local-sc storage class name is a WaitForFirstConsumer manual created class which is related to local PVs?AndD
@AndD Yes, correct. StatefulSet, PVs PVCs are all working. Just unsure which PVs PVC will fetch, and how this is determined.Sida Zhou
I've answered, trying to explain and also quoting the official docs. If something is not clear, let me knowAndD

1 Answers

3
votes

With local Persistent Volumes, this is the expected behaviour. Let me try to explain what happens when using local storage.

The usual setup for local storage on a cluster is the following:

  • A local storage class, configured to be WaitForFirstConsumer
  • A series of local persistent volumes, linked to the local storage class

And this is all well documented with examples in the official documentation: https://kubernetes.io/docs/concepts/storage/volumes/#local

With this done, Persistent Volume Claims can request storage from the local storage class and StatefulSets can have a volumeClaimTemplate which requests storage of the local storage class.


Let me take as example your StatefulSet with 3 replicas, each one requires local storage with the volumeClaimTemplate.

  • When the Pods are first created, they request a storage of the required storageClass. For example your my-local-sc

  • Since this storage class is manually created and does not support dynamically provisioning of new PVs (like, for example, Ceph or similar) it is checked if a PV attached to the storage class is available to be bound.

  • If a PV is selected, it is bound to the newly created PVC (and from now, can be used only with that particular PV, since it is now Bound)

  • Since the PV is of type local, the PV has a nodeAffinity required which selects a node.

  • This force the Pod, now bound to that PV, to be scheduled only on that particular node.

This is why each Pod was scheduled on the same node of the bounded persistent volume. And this means that the Pod is restricted to run on that node only.

You can test this easily by draining / cordoning one of the nodes and then trying to restart the Pod bound to the PV available on that particular node. What you should see is that the Pod will not start, as the PV is restricted from its nodeAffinity and the node is not available.


Once each Pod of the StatefulSet is bound to a PV, that Pod will be scheduled only on a specific node.. Pods will not change the PV that they are using, unless the PVC is removed (which will force the Pod to request again a new PV to bound)

Since local storage is handled manually, PV which were bounded and have the related PVC removed from the cluster, enter in Released state and cannot be claimed anymore, they must be handled by someone.. maybe deleting them and then recreating new ones at the same location (and maybe cleaning the filesystem as well, depending on the situation)

This means that local storage is OK to be used only:

  • If HA is not a problem.. for example, I don't care if my app is blocked by a single node not working

  • If HA is handled directly by the app itself. For example, a StatefulSet with 3 Pods like a multi-primary database (Galera, Clickhouse, Percona for examples) or ElasticSearch or Kafka, Zookeeper or something like that.. all will handle the HA on their own as they can resist one of their nodes being down as long as there's quorum.


UPDATE

Regarding the Scenario 2 of your question. Let's say you have multiple Available PVs and a single Pod which starts and wants to Bound to one of them. This is a normal behaviour and the control plane would select one of those PVs on its own (if they match with the requests in Claim)

There's a specific way to pre-bind a PV and a PVC, so that they will always bind together. This is described in the docs as "reserving a PV": https://kubernetes.io/docs/concepts/storage/persistent-volumes/#reserving-a-persistentvolume

But the problem is that this cannot be applied to olume claim templates, as it requires the claim to be created manually with special properties.

The volume claim template tho, as a selector field which can be used to restrict the selection of a PV based on labels. It can be seen in the API specs ( https://v1-18.docs.kubernetes.io/docs/reference/generated/kubernetes-api/v1.18/#persistentvolumeclaimspec-v1-core )

When you create a PV, you label it with what you want.. for example you could label it like the following:

apiVersion: v1
kind: PersistentVolume
metadata:
  name: example-small-pv
  labels:
    size-category: small
spec:
  capacity:
    storage: 10Gi
  volumeMode: Filesystem
  accessModes:
  - ReadWriteOnce
  persistentVolumeReclaimPolicy: Retain
  storageClassName: local-storage
  local:
    path: /mnt/disks/ssd1
  nodeAffinity:
    required:
      nodeSelectorTerms:
      - matchExpressions:
        - key: kubernetes.io/hostname
          operator: In
          values:
          - example-node-1
---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: example-big-pv
  labels:
    size-category: big
spec:
  capacity:
    storage: 100Gi
  volumeMode: Filesystem
  accessModes:
  - ReadWriteOnce
  persistentVolumeReclaimPolicy: Retain
  storageClassName: local-storage
  local:
    path: /mnt/disks/ssd1
  nodeAffinity:
    required:
      nodeSelectorTerms:
      - matchExpressions:
        - key: kubernetes.io/hostname
          operator: In
          values:
          - example-node-2

And then the claim template can select a category of volumes based on the label. Or maybe it doesn't care so it doesn't specify selector and can use all of them (provided that the size is enough for its claim request)

This could be useful.. but it's not the only way to select or restrict which PVs can be selected, because when the PV is first bound, if the storage class is WaitForFirstConsumer, the following is also applied:

Delaying volume binding ensures that the PersistentVolumeClaim binding decision will also be evaluated with any other node constraints the Pod may have, such as node resource requirements, node selectors, Pod affinity, and Pod anti-affinity.

Which means that if the Pod has a node affinity to one of the nodes, it will select for sure a PV on that node (if the local storage class used is WaitForFirstConsumer)


Last, let me quote the offical documentation for things that I think they could answer your questions:

From https://kubernetes.io/docs/concepts/storage/persistent-volumes/

A user creates, or in the case of dynamic provisioning, has already created, a PersistentVolumeClaim with a specific amount of storage requested and with certain access modes. A control loop in the master watches for new PVCs, finds a matching PV (if possible), and binds them together. If a PV was dynamically provisioned for a new PVC, the loop will always bind that PV to the PVC. Otherwise, the user will always get at least what they asked for, but the volume may be in excess of what was requested. Once bound, PersistentVolumeClaim binds are exclusive, regardless of how they were bound. A PVC to PV binding is a one-to-one mapping, using a ClaimRef which is a bi-directional binding between the PersistentVolume and the PersistentVolumeClaim.

Claims will remain unbound indefinitely if a matching volume does not exist. Claims will be bound as matching volumes become available. For example, a cluster provisioned with many 50Gi PVs would not match a PVC requesting 100Gi. The PVC can be bound when a 100Gi PV is added to the cluster.

From https://kubernetes.io/docs/concepts/storage/volumes/#local

Compared to hostPath volumes, local volumes are used in a durable and portable manner without manually scheduling pods to nodes. The system is aware of the volume's node constraints by looking at the node affinity on the PersistentVolume.

However, local volumes are subject to the availability of the underlying node and are not suitable for all applications. If a node becomes unhealthy, then the local volume becomes inaccessible by the pod. The pod using this volume is unable to run. Applications using local volumes must be able to tolerate this reduced availability, as well as potential data loss, depending on the durability characteristics of the underlying disk.