We encountered a production incident, that Elasticsearch cluster health check returned red
status. The health check report shows marvel-2019.06.20
has 2 unassigned_shards, which seems the root cause.
curl -XGET 'localhost:9200/_cluster/health?level=indices&pretty'
{
"cluster_name" : "sap-jam-jam8",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 2,
"number_of_data_nodes" : 2,
"active_primary_shards" : 122,
"active_shards" : 239,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 7,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"indices" : {
...
...
".marvel-2019.06.20" : {
"status" : "red",
"number_of_shards" : 1,
"number_of_replicas" : 1,
"active_primary_shards" : 0,
"active_shards" : 0,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 2
}
}
we checked the config of Elasticseach, found cluster.routing.allocation
has been disabled.
curl -XGET 'localhost:9200/_cluster/settings?pretty'
{
"persistent" : { },
"transient" : {
"cluster" : {
"routing" : {
"allocation" : {
"enable" : "none"
}
}
}
}
}
As this stackoverflow post suggested, we forced a shard to be assigned, and this issue has gone.
curl -XPOST -d '{ "commands" : [ {
"allocate" : {
"index" : ".marvel-2014.05.21",
"shard" : 0,
"node" : "SOME_NODE_HERE",
"allow_primary":true
}
} ] }' http://localhost:9200/_cluster/reroute?pretty
After resolved this incident, I think it's necessary to figure out the basic concept shard allocation
. I did some research, but the following questions are still confusing for me.
1. Why elasticsearch needs to assign shard
to other nodes?
In my case, we have two elasticsearch nodes, A and B. Two shards have already been created in A, and consumed disk space.
When B is not available, why not just active those two shards in server A?
At least it return a yellow
health status.
2. What's the procedures of assign a shard
?
In the first question, we suppose both primary shard and replica has been created in server A. when saying assign shard to B
, what does that mean?
Doest that mean copy shard from server A to server B?
3. How to explain this zero active shard?
Both primary shard and replicate has been created, but are not active. How is it possible? Besides disk storage, is there other overhead to activate a shard? e.g. Memory?
".marvel-2019.06.20" : {
"status" : "red",
"number_of_shards" : 1,
"number_of_replicas" : 1,
"active_primary_shards" : 0,
"active_shards" : 0, // both shards are inactive.
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 2
}
4. Is the following assumption true?
To make a shard active, Elasticsearch need do the following steps:
- Create a shard.
- Find a server, which has enough disk space and RAM to run it.
- copy this shard from source server to destination server.
- Activate this shard.