Fix elasticsearch broken cluster within kubernetes

Question

I deployed an elasticsearch cluster with official Helm chart (https://github.com/elastic/helm-charts/tree/master/elasticsearch).

There are 3 Helm releases:

master (3 nodes)
client (1 node)
data (2 nodes)

Cluster was running fine, I did a crash test by removing master release, and re-create it.

After that, master nodes are ok, but data nodes complain:

Caused by: org.elasticsearch.cluster.coordination.CoordinationStateRejectedException: join validation on cluster state with a different cluster uuid xeQ6IVkDQ2es1CO2yZ_7rw than local cluster uuid 9P9ZGqSuQmy7iRDGcit5fg, rejecting

which is normal because master nodes are new.

How can I fix data nodes cluster state without removing data folder?

Edit:

I know the reason why is broken, I know a basic solution is to remove data folder and restart node (as I can see on elastic forum, lot of similar questions without answers). But I am looking for a production aware solution, maybe with https://www.elastic.co/guide/en/elasticsearch/reference/current/node-tool.html tool?

Thomas Decaux Thomas Decaux · Accepted Answer · 2021-01-04T12:30:34

Using elasticsearch-node utility, it's possible to reset cluster state, then the fresh node can join another cluster.

The tricky thing is to use this utility bin with Docker, because elasticsearch server must be stopped!

Solution with kubernetes:

Stop pods by scaling to 0 the sts: kubectl scale data-nodes --replicas=0
Create a k8s job that reset the cluster state, with data volume attached
Apply the job for each PVC
Rescale sts and enjoy!

job.yaml:

apiVersion: batch/v1
kind: Job
metadata:
  name: test-fix-cluster-m[0-3]
spec:
  template:
    spec:
      containers:
      - args:
        - -c
        - yes | elasticsearch-node detach-cluster; yes | elasticsearch-node remove-customs '*'
        # uncomment for at least 1 PVC
        #- yes | elasticsearch-node unsafe-bootstrap -v
        command:
        - /bin/sh
        image: docker.elastic.co/elasticsearch/elasticsearch:7.10.1
        name: elasticsearch
        volumeMounts:
        - mountPath: /usr/share/elasticsearch/data
          name: es-data
      restartPolicy: Never
      volumes:
      - name: es-data
        persistentVolumeClaim:
          claimName: es-test-master-es-test-master-[0-3]

If you are interested, here the code behind unsafe-bootstrap: https://github.com/elastic/elasticsearch/blob/master/server/src/main/java/org/elasticsearch/cluster/coordination/UnsafeBootstrapMasterCommand.java#L83

I have written a small story at https://medium.com/@thomasdecaux/fix-broken-elasticsearch-cluster-405ad67ee17c.

Fix elasticsearch broken cluster within kubernetes

2 Answers