K8s Elasticsearch with filebeat is keeping 'not ready' after rebooting

Question

I'm going through a not very understandable situation.

Environment
Two dedicated nodes with azure centos 8.2 (2vcpu, 16G ram), not AKS

1 master node, 1 worker node.

kubernetes v1.19.3

helm v2.16.12

Helm charts Elastic (https://github.com/elastic/helm-charts/tree/7.9.3)

At the first time, It works fine with below installation.

## elasticsearch, filebeat
# kubectl apply -f pv.yaml
# helm install -f values.yaml --name elasticsearch elastic/elasticsearch
# helm install --name filebeat --version 7.9.3 elastic/filebeat

curl elasitcsearchip:9200 and curl elasitcsearchip:9200/_cat/indices show right values.

but after rebooting a worker node, it just keeping ready 0/1 and not working.

NAME READY STATUS RESTARTS AGE
elasticsearch-master-0 0/1 Running 10 71m
filebeat-filebeat-67qm2 0/1 Running 4 40m

In this situation, after removing /mnt/data/nodes and rebooting again then works fine.

elasticsearch pod has nothing special I think.

#describe
{"type": "server", "timestamp": "2020-10-26T07:49:49,708Z", "level": "INFO", "component": "o.e.c.r.a.AllocationService", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-0", "message": "Cluster health status changed from [RED] to [YELLOW] (reason: [shards started [[filebeat-7.9.3-2020.10.26-000001][0]]]).", "cluster.uuid": "sWUAXJG9QaKyZDe0BLqwSw", "node.id": "ztb35hToRf-2Ahr7olympw"  }

#logs
  Normal   SandboxChanged          4m4s (x3 over 4m9s)   kubelet          Pod sandbox changed, it will be killed and re-created.
  Normal   Pulled                  4m3s                  kubelet          Container image "docker.elastic.co/elasticsearch/elasticsearch:7.9.3" already present on machine
  Normal   Created                 4m1s                  kubelet          Created container configure-sysctl
  Normal   Started                 4m1s                  kubelet          Started container configure-sysctl
  Normal   Pulled                  3m58s                 kubelet          Container image "docker.elastic.co/elasticsearch/elasticsearch:7.9.3" already present on machine
  Normal   Created                 3m58s                 kubelet          Created container elasticsearch
  Normal   Started                 3m57s                 kubelet          Started container elasticsearch
  Warning  Unhealthy               91s (x14 over 3m42s)  kubelet          Readiness probe failed: Waiting for elasticsearch cluster to become ready (request params: "wait_for_status=green&timeout=1s" )
Cluster is not yet ready (request params: "wait_for_status=green&timeout=1s" )

#events
6m1s        Normal    Pulled                    pod/elasticsearch-master-0                     Container image "docker.elastic.co/elasticsearch/elasticsearch:7.9.3" already present on machine
6m1s        Normal    Pulled                    pod/filebeat-filebeat-67qm2                    Container image "docker.elastic.co/beats/filebeat:7.9.3" already present on machine
5m59s       Normal    Started                   pod/elasticsearch-master-0                     Started container configure-sysctl
5m59s       Normal    Created                   pod/elasticsearch-master-0                     Created container configure-sysctl
5m59s       Normal    Created                   pod/filebeat-filebeat-67qm2                    Created container filebeat
5m58s       Normal    Started                   pod/filebeat-filebeat-67qm2                    Started container filebeat
5m56s       Normal    Created                   pod/elasticsearch-master-0                     Created container elasticsearch
5m56s       Normal    Pulled                    pod/elasticsearch-master-0                     Container image "docker.elastic.co/elasticsearch/elasticsearch:7.9.3" already present on machine
5m55s       Normal    Started                   pod/elasticsearch-master-0                     Started container elasticsearch
61s         Warning   Unhealthy                 pod/filebeat-filebeat-67qm2                    Readiness probe failed: elasticsearch: http://elasticsearch-master:9200...
  parse url... OK
  connection...
    parse host... OK
    dns lookup... OK
    addresses: 10.97.133.135
    dial up... ERROR dial tcp 10.97.133.135:9200: connect: connection refused
59s         Warning   Unhealthy                 pod/elasticsearch-master-0                     Readiness probe failed: Waiting for elasticsearch cluster to become ready (request params: "wait_for_status=green&timeout=1s" )
Cluster is not yet ready (request params: "wait_for_status=green&timeout=1s" )

/mnt/data path has chown 1000:1000

and In case of only elastisearch without filebeat, rebooting has no problem.

I can't figure this out at all. :(

What am I missing?

pv.yaml

kind: PersistentVolume
apiVersion: v1
metadata:
  name: elastic-pv
  labels:
    type: local
    app: elastic
spec:
  storageClassName: local-storage
  capacity:
    storage: 10Gi
  accessModes:
    - ReadWriteOnce
  persistentVolumeReclaimPolicy: Retain
  claimRef: 
    namespace: default
    name: elasticsearch-master-elasticsearch-master-0
  hostPath:
    path: "/mnt/data"

values.yaml

---
clusterName: "elasticsearch"
nodeGroup: "master"

# The service that non master groups will try to connect to when joining the cluster
# This should be set to clusterName + "-" + nodeGroup for your master group
masterService: ""

# Elasticsearch roles that will be applied to this nodeGroup
# These will be set as environment variables. E.g. node.master=true
roles:
  master: "true"
  ingest: "true"
  data: "true"

replicas: 1
minimumMasterNodes: 1

esMajorVersion: ""

# Allows you to add any config files in /usr/share/elasticsearch/config/
# such as elasticsearch.yml and log4j2.properties
esConfig: {}
#  elasticsearch.yml: |
#    key:
#      nestedkey: value
#  log4j2.properties: |
#    key = value

# Extra environment variables to append to this nodeGroup
# This will be appended to the current 'env:' key. You can use any of the kubernetes env
# syntax here
extraEnvs: []
#  - name: MY_ENVIRONMENT_VAR
#    value: the_value_goes_here

# Allows you to load environment variables from kubernetes secret or config map
envFrom: []
# - secretRef:
#     name: env-secret
# - configMapRef:
#     name: config-map

# A list of secrets and their paths to mount inside the pod
# This is useful for mounting certificates for security and for mounting
# the X-Pack license
secretMounts: []
#  - name: elastic-certificates
#    secretName: elastic-certificates
#    path: /usr/share/elasticsearch/config/certs
#    defaultMode: 0755

image: "docker.elastic.co/elasticsearch/elasticsearch"
imageTag: "7.9.3"
imagePullPolicy: "IfNotPresent"

podAnnotations: {}
  # iam.amazonaws.com/role: es-cluster

# additionals labels
labels: {}
esJavaOpts: "-Xmx1g -Xms1g"

resources:
  requests:
    cpu: "500m"
    memory: "1Gi"
  limits:
    cpu: "1000m"
    memory: "2Gi"

initResources: {}
  # limits:
  #   cpu: "25m"
  #   # memory: "128Mi"
  # requests:
  #   cpu: "25m"
  #   memory: "128Mi"

sidecarResources: {}
  # limits:
  #   cpu: "25m"
  #   # memory: "128Mi"
  # requests:
  #   cpu: "25m"
  #   memory: "128Mi"

networkHost: "0.0.0.0"

volumeClaimTemplate:
  accessModes: [ "ReadWriteOnce" ]
  storageClassName: local-storage
  resources:
    requests:
      storage: 5Gi

rbac:
  create: false
  serviceAccountAnnotations: {}
  serviceAccountName: ""

podSecurityPolicy:
  create: false
  name: ""
  spec:
    privileged: true
    fsGroup:
      rule: RunAsAny
    runAsUser:
      rule: RunAsAny
    seLinux:
      rule: RunAsAny
    supplementalGroups:
      rule: RunAsAny
    volumes:
      - secret
      - configMap
      - persistentVolumeClaim

persistence:
  enabled: true
  name: elastic-vc
  labels:
    # Add default labels for the volumeClaimTemplate fo the StatefulSet
    app: elastic
  annotations: {}

extraVolumes: []
  # - name: extras
  #   emptyDir: {}

extraVolumeMounts: []
  # - name: extras
  #   mountPath: /usr/share/extras
  #   readOnly: true

extraContainers: []
  # - name: do-something
  #   image: busybox
  #   command: ['do', 'something']

extraInitContainers: []
  # - name: do-something
  #   image: busybox
  #   command: ['do', 'something']

# This is the PriorityClass settings as defined in
# https://kubernetes.io/docs/concepts/configuration/pod-priority-preemption/#priorityclass
priorityClassName: ""

# By default this will make sure two pods don't end up on the same node
# Changing this to a region would allow you to spread pods across regions
antiAffinityTopologyKey: "kubernetes.io/hostname"

# Hard means that by default pods will only be scheduled if there are enough nodes for them
# and that they will never end up on the same node. Setting this to soft will do this "best effort"
antiAffinity: "hard"

# This is the node affinity settings as defined in
# https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#node-affinity-beta-feature
nodeAffinity: {}

# The default is to deploy all pods serially. By setting this to parallel all pods are started at
# the same time when bootstrapping the cluster
podManagementPolicy: "Parallel"

# The environment variables injected by service links are not used, but can lead to slow Elasticsearch boot times when
# there are many services in the current namespace.
# If you experience slow pod startups you probably want to set this to `false`.
enableServiceLinks: true

protocol: http
httpPort: 9200
transportPort: 9300

service:
  labels: {}
  labelsHeadless: {}
  type: ClusterIP
  nodePort: ""
  annotations: {}
  httpPortName: http
  transportPortName: transport
  loadBalancerIP: ""
  loadBalancerSourceRanges: []
  externalTrafficPolicy: ""

updateStrategy: RollingUpdate

# This is the max unavailable setting for the pod disruption budget
# The default value of 1 will make sure that kubernetes won't allow more than 1
# of your pods to be unavailable during maintenance
maxUnavailable: 1

podSecurityContext:
  fsGroup: 1000
  runAsUser: 1000

securityContext:
  capabilities:
    drop:
    - ALL
  #readOnlyRootFilesystem: false
  runAsNonRoot: true
  runAsUser: 1000

# How long to wait for elasticsearch to stop gracefully
terminationGracePeriod: 120

sysctlVmMaxMapCount: 262144

readinessProbe:
  failureThreshold: 3
  initialDelaySeconds: 10
  periodSeconds: 10
  successThreshold: 3
  timeoutSeconds: 5

# https://www.elastic.co/guide/en/elasticsearch/reference/7.9/cluster-health.html#request-params wait_for_status
clusterHealthCheckParams: "wait_for_status=green&timeout=1s"

## Use an alternate scheduler.
## ref: https://kubernetes.io/docs/tasks/administer-cluster/configure-multiple-schedulers/
##
schedulerName: ""

imagePullSecrets: []
nodeSelector: {}
tolerations: []
  # - effect: NoSchedule
  #   key: node-role.kubernetes.io/master

# Enabling this will publically expose your Elasticsearch instance.
# Only enable this if you have security enabled on your cluster
ingress:
  enabled: false
  annotations: {}
    # kubernetes.io/ingress.class: nginx
    # kubernetes.io/tls-acme: "true"
  path: /
  hosts:
    - chart-example.local
  tls: []
  #  - secretName: chart-example-tls
  #    hosts:
  #      - chart-example.local

nameOverride: ""
fullnameOverride: ""

# https://github.com/elastic/helm-charts/issues/63
masterTerminationFix: false

lifecycle: {}
  # preStop:
  #   exec:
  #     command: ["/bin/sh", "-c", "echo Hello from the postStart handler > /usr/share/message"]
  # postStart:
  #   exec:
  #     command:
  #       - bash
  #       - -c
  #       - |
  #         #!/bin/bash
  #         # Add a template to adjust number of shards/replicas
  #         TEMPLATE_NAME=my_template
  #         INDEX_PATTERN="logstash-*"
  #         SHARD_COUNT=8
  #         REPLICA_COUNT=1
  #         ES_URL=http://localhost:9200
  #         while [[ "$(curl -s -o /dev/null -w '%{http_code}\n' $ES_URL)" != "200" ]]; do sleep 1; done
  #         curl -XPUT "$ES_URL/_template/$TEMPLATE_NAME" -H 'Content-Type: application/json' -d'{"index_patterns":['\""$INDEX_PATTERN"\"'],"settings":{"number_of_shards":'$SHARD_COUNT',"number_of_replicas":'$REPLICA_COUNT'}}'

sysctlInitContainer:
  enabled: true

keystore: []

# Deprecated
# please use the above podSecurityContext.fsGroup instead
fsGroup: ""

Could you check if there is anything in filebeat and elasticsearch pods with kubectl logs? Additionally could you please add output from kubectl describe of your filebeat pod? Could you check if it's gonna work if you change the reclaim policy from persistentVolumeReclaimPolicy: Retain to persistentVolumeReclaimPolicy: Recycle? — Jakub
@Jakub Hi, thanks for your reply. Recycle value returns also the same result.:( I've attached some information on kubectl describe, kubectl logs, and events. It's the results of 'Retain' PV. A postfix ready0 file means READY 0/1, STATUS Running after rebooting, and else means working fine at the moment (READY 1/1, STATUS Running). drive.google.com/file/d/1nvopi66fXHBh3HMjokyarh-EsveK9pK2/… — Klaud Yu
In the filebeat logs there is an issue with a flannel CNI, networkPlugin cni failed to set up pod xxx network: open /run/flannel/subset.env: no such file or directory. Could you tell me if your flannel pod is up and running? Additionally could you please check if there is anything in the kubelet logs with journalctl -u kubelet? Second thing is the readiness probe, there is a workaround for that on this github issue. Could you try it and check if it's gonna work? — Jakub
Errors you said about networkPlugin come from rebooting action. As your github issue link I tried 'clusterHealthCheckParams: "wait_for_status=yellow&timeout=1s"' and it works!! This symptom may occur when replicas: 1 minimumMasterNodes: 1 Thank you very much @Jakub :) — Klaud Yu
Happy to help. I have posted an answer with these informations. If this answer or any other one solved your issue, please mark it as accepted or upvote it as per stackoverflow rules. — Jakub

Jakub Jakub · Accepted Answer · 2020-10-28T09:16:31

Issue

There is an issue with elasticsearch readiness probe when running on single replica cluster.

Warning  Unhealthy               91s (x14 over 3m42s)  kubelet          Readiness probe failed: Waiting for elasticsearch cluster to become ready (request params: "wait_for_status=green&timeout=1s" )
Cluster is not yet ready (request params: "wait_for_status=green&timeout=1s" )

Solution

As mentioned here by @adinhodovic

If your running a single replica cluster add the following helm value:

clusterHealthCheckParams: "wait_for_status=yellow&timeout=1s"

Your status will never go green with a single replica cluster.

The following values should work:

replicas: 1
minimumMasterNodes: 1
clusterHealthCheckParams: 'wait_for_status=yellow&timeout=1s'

K8s Elasticsearch with filebeat is keeping 'not ready' after rebooting

1 Answers

Issue

Solution