1
votes

I have a 4 nodes & 5 shards Elastic Search (0.90.3) cluster. On restart, I see 4 of 5 shards unassigned and cluster status is red. So I am assuming the way it was restarted was not right. Each node was issued a kill (SIGKILL) command in 30 seconds interval. Meaning some node was killed, 30 seconds later some other node from the remaining 3 was killed & so on.

I tried this solution to have shards reassigned but nothing worked until I manual assigned a primary shard to the cluster using this approach. But manual assigning of primary shard resets the data for the shard resulting in loss.

How do I avoid getting into the unassigned shard problem? And If I am stuck with that problem what is the way to recover without data loss?

1
Instead of calling a kill command on the process, I would typically shut the node down. Less likely to cause any upset to the system as it will follow a shutdown procedure - elasticsearch.org/guide/en/elasticsearch/reference/current/…Nathan Smith
@Nate Thanks, that is something I am planning to do. Also Can you show some light as why would the problem of unassigned shards happen? I honestly do not know what caused this to happen. Yes SIGKILL might have caused it but I am not able to see what ensued after SIGKILL for this to happen.Prasanna
If you're trying to avoid downtime, shouldn't you be restarting each node before shutting down the next one?Avish
@Avish You are right. But unfortunately that is not the case right now. Lesson learnt is to do the update one node at a time. But I am really curious to know as what happened to ElasticSearch that it was not able to form the cluster again on startup.Prasanna

1 Answers

2
votes

The correct way to restart a cluster is to do a rolling restart using the shutdown API.

This works by:

  1. Disabling shard allocation
  2. Restarting one node (cluster goes yellow)
  3. Wait until it rejoins the cluster
  4. Re-enable shard allocation
  5. Wait until shards are reallocated (cluster goes green)
  6. Repeat on other nodes.

You may want to increase indices.recovery.max_bytes_per_sec and cluster.routing.allocation.node_concurrent_recoveries to speed up step 5. Whilst the cluster is yellow, some shards will be unassigned (because they were on the node that was restarted), but this not a problem. Reads and writes will still work as normal.