Recover from k8s and helm drift

Question

We use helm to manage all our resources in a k8s cluster. Recently we had an incident where some k8s resources were modified outside of helm (see below for details on the root cause).

The end result is however, that we have some k8s resources in our cluster that do not match what is specified in the helm chart of the release.

Example:

We have a helm chart that contains a HorizontalPodAutoscaler. If I do something like:

helm get myservice-release

I will see something like this:

---
# Source: myservice/charts/default-deployment/templates/default_deployment.yaml
apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
  name: myservice-autoscaler
  labels:     
    app: myservice
spec:
  minReplicas: 2
  maxReplicas: 10
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: myservice-deployment
  metrics:
  - type: Resource
    resource:
      name: cpu
      targetAverageUtilization: 85

However, if I do:

kubectl get hpa myservice-autoscaler -o yaml

The spec.{max,min}Replicas does not match the Chart:

apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
  annotations:
    autoscaling.alpha.kubernetes.io/conditions: '{REDACTED}'
    autoscaling.alpha.kubernetes.io/current-metrics: '{REDACTED}'
  creationTimestamp: "{REDACTED}"
  labels:
    app: myservice
  name: myservice-autoscaler
  namespace: default
  resourceVersion: "174526833"
  selfLink: /apis/autoscaling/v1/namespaces/default/horizontalpodautoscalers/myservice-autoscaler
  uid: {REDACTED}
spec:
  maxReplicas: 1
  minReplicas: 1
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: myservice-deployment
  targetCPUUtilizationPercentage: 85
status:
  currentCPUUtilizationPercentage: 9
  currentReplicas: 1
  desiredReplicas: 1
  lastScaleTime: "{REACTED}"

I suspect, there are more than this one occurrence of drift in the k8s resources.

How do I verify which resources have drifted?
How do I inform helm of that drift, so the next deployment could take it into account when applying the release diff?

EDIT:

For those of you interested, this was caused by two helm charts managing the same resources (autoscaling) both setting different values.

This occurred because two helm releases that were meant for different namespaces ended up in the same and were updated with --force.

themarex themarex · Accepted Answer · 2019-07-29T11:37:00

We figured out a way to do this in a scalable way. Note, that this solution requires Kubernetes 1.13 for support of kubectl diff.

The overall idea is to fetch the helm state and apply it using kubectl to sync the two again. This might be unsafe on your cluster, please verify the changes with kubectl diff.

Fetch the state from helm: helm get manifest {service}-release > {service}-release.yaml
Check if there is a difference to the k8s objects: kubectl diff -f {service}-release.yaml
Overwrite the k8s state with the helm state: kubectl apply -f {service}-release.yaml

Recover from k8s and helm drift

2 Answers