1
votes

I deployed an elasticsearch cluster with official Helm chart (https://github.com/elastic/helm-charts/tree/master/elasticsearch).

There are 3 Helm releases:

  • master (3 nodes)
  • client (1 node)
  • data (2 nodes)

Cluster was running fine, I did a crash test by removing master release, and re-create it.

After that, master nodes are ok, but data nodes complain:

Caused by: org.elasticsearch.cluster.coordination.CoordinationStateRejectedException: join validation on cluster state with a different cluster uuid xeQ6IVkDQ2es1CO2yZ_7rw than local cluster uuid 9P9ZGqSuQmy7iRDGcit5fg, rejecting

which is normal because master nodes are new.

How can I fix data nodes cluster state without removing data folder?

Edit:

I know the reason why is broken, I know a basic solution is to remove data folder and restart node (as I can see on elastic forum, lot of similar questions without answers). But I am looking for a production aware solution, maybe with https://www.elastic.co/guide/en/elasticsearch/reference/current/node-tool.html tool?

2

2 Answers

0
votes

Using elasticsearch-node utility, it's possible to reset cluster state, then the fresh node can join another cluster.

The tricky thing is to use this utility bin with Docker, because elasticsearch server must be stopped!

Solution with kubernetes:

  1. Stop pods by scaling to 0 the sts: kubectl scale data-nodes --replicas=0
  2. Create a k8s job that reset the cluster state, with data volume attached
  3. Apply the job for each PVC
  4. Rescale sts and enjoy!

job.yaml:

apiVersion: batch/v1
kind: Job
metadata:
  name: test-fix-cluster-m[0-3]
spec:
  template:
    spec:
      containers:
      - args:
        - -c
        - yes | elasticsearch-node detach-cluster; yes | elasticsearch-node remove-customs '*'
        # uncomment for at least 1 PVC
        #- yes | elasticsearch-node unsafe-bootstrap -v
        command:
        - /bin/sh
        image: docker.elastic.co/elasticsearch/elasticsearch:7.10.1
        name: elasticsearch
        volumeMounts:
        - mountPath: /usr/share/elasticsearch/data
          name: es-data
      restartPolicy: Never
      volumes:
      - name: es-data
        persistentVolumeClaim:
          claimName: es-test-master-es-test-master-[0-3]

If you are interested, here the code behind unsafe-bootstrap: https://github.com/elastic/elasticsearch/blob/master/server/src/main/java/org/elasticsearch/cluster/coordination/UnsafeBootstrapMasterCommand.java#L83

I have written a small story at https://medium.com/@thomasdecaux/fix-broken-elasticsearch-cluster-405ad67ee17c.

-3
votes

As its clear from the log message, that cluster state of your data nodes are not updated after you changed the master node.

If you can afford to loose the data(as you are just testing), than you can simply delete the data-folder where ES stores the cluster state(Note this requires removing the data-node, deleting data folder and adding them back) otherwise simply follow the official article on how to add/remove master eligible nodes in ES cluster.