I'm trying to use sentinel for failover in large redis fleet (12 sentinels, 500+ shard of one master and one slave each). I'm encountering a very strange issue where my sentinels repeatedly issue the command +fix-slave-config to certain redis nodes. I did not notice this happening at smaller scale, for what it is worth.
I've noticed two specific issues:
- +fix-slave-config messages, as stated above
- The sentinel.conf shows certain slaves having two masters (they should only have one)
The fleet in it's starting state has a certain slave node XXX.XXX.XXX.177 with a master XXX.XXX.XXX.244 (together, they comprise shard 188 in the fleet). Without any node outages, the master of the slave is switched to XXX.XXX.XXX.96 (master for shard 188) and then back, and then forth. This is verified by sshing into the slave and master nodes and checking redis-cli info. All Redis nodes started in the correct configuration. All Sentinel nodes had the correct configuration in their sentinel.conf. Each Sentinel has the exact same list of masters when I query them after each of these slave->master changes.
Across my 12 sentinels, the following is logged. Every minute, there is a +fix-slave-config message sent:
Sentinel #8: 20096:X 22 Oct 01:41:49.793 * +fix-slave-config slave XXX.XXX.XXX.177:6379 XXX.XXX.XXX.177 6379 @ shard-188 XXX.XXX.XXX.96 6379
Sentinel #1: 9832:X 22 Oct 01:42:50.795 * +fix-slave-config slave XXX.XXX.XXX.177:6379 XXX.XXX.XXX.177 6379 @ shard-172 XXX.XXX.XXX.244 6379
Sentinel #6: 20528:X 22 Oct 01:43:52.458 * +fix-slave-config slave XXX.XXX.XXX.177:6379 XXX.XXX.XXX.177 6379 @ shard-188 XXX.XXX.XXX.96 6379
Sentinel #10: 20650:X 22 Oct 01:43:52.464 * +fix-slave-config slave XXX.XXX.XXX.177:6379 XXX.XXX.XXX.177 6379 @ shard-188 XXX.XXX.XXX.96 6379
Sentinel #2: 20838:X 22 Oct 01:44:53.489 * +fix-slave-config slave XXX.XXX.XXX.177:6379 XXX.XXX.XXX.177 6379 @ shard-172 XXX.XXX.XXX.244 6379
Here's the output of the SENTINEL MASTERS command. The strange thing is that shard-188 has two slaves, when in fact it should only have 1. The output looks the same for when XXX.XXX.XXX.177 is under shard-172 and shard-182.
Case 1) XXX.XXX.XXX.244 is master of XXX.XXX.XXX.177
183) 1) "name"
2) "shard-172"
3) "ip"
4) "XXX.XXX.XXX.244"
5) "port"
6) "6379"
7) "runid"
8) "ca02da1f0002a25a880e6765aed306b1857ae2f7"
9) "flags"
10) "master"
11) "pending-commands"
12) "0"
13) "last-ping-sent"
14) "0"
15) "last-ok-ping-reply"
16) "14"
17) "last-ping-reply"
18) "14"
19) "down-after-milliseconds"
20) "30000"
21) "info-refresh"
22) "5636"
23) "role-reported"
24) "master"
25) "role-reported-time"
26) "17154406"
27) "config-epoch"
28) "0"
29) "num-slaves"
30) "1"
31) "num-other-sentinels"
32) "12"
33) "quorum"
34) "7"
35) "failover-timeout"
36) "60000"
37) "parallel-syncs"
38) "1"
72) 1) "name"
2) "shard-188"
3) "ip"
4) "XXX.XXX.XXX.96"
5) "port"
6) "6379"
7) "runid"
8) "95cd3a457ef71fc91ff1a1c5a6d5d4496b266167"
9) "flags"
10) "master"
11) "pending-commands"
12) "0"
13) "last-ping-sent"
14) "0"
15) "last-ok-ping-reply"
16) "927"
17) "last-ping-reply"
18) "927"
19) "down-after-milliseconds"
20) "30000"
21) "info-refresh"
22) "5333"
23) "role-reported"
24) "master"
25) "role-reported-time"
26) "17154312"
27) "config-epoch"
28) "0"
29) "num-slaves"
30) "2"
31) "num-other-sentinels"
32) "12"
33) "quorum"
34) "7"
35) "failover-timeout"
36) "60000"
37) "parallel-syncs"
38) "1"
Case 2) XXX.XXX.XXX.96 is master of XXX.XXX.XXX.177
79) 1) "name"
2) "shard-172"
3) "ip"
4) "XXX.XXX.XXX.244"
5) "port"
6) "6379"
7) "runid"
8) "ca02da1f0002a25a880e6765aed306b1857ae2f7"
9) "flags"
10) "master"
11) "pending-commands"
12) "0"
13) "last-ping-sent"
14) "0"
15) "last-ok-ping-reply"
16) "1012"
17) "last-ping-reply"
18) "1012"
19) "down-after-milliseconds"
20) "30000"
21) "info-refresh"
22) "1261"
23) "role-reported"
24) "master"
25) "role-reported-time"
26) "17059720"
27) "config-epoch"
28) "0"
29) "num-slaves"
30) "1"
31) "num-other-sentinels"
32) "12"
33) "quorum"
34) "7"
35) "failover-timeout"
36) "60000"
37) "parallel-syncs"
38) "1"
273) 1) "name"
2) "shard-188"
3) "ip"
4) "XXX.XXX.XXX.96"
5) "port"
6) "6379"
7) "runid"
8) "95cd3a457ef71fc91ff1a1c5a6d5d4496b266167"
9) "flags"
10) "master"
11) "pending-commands"
12) "0"
13) "last-ping-sent"
14) "0"
15) "last-ok-ping-reply"
16) "886"
17) "last-ping-reply"
18) "886"
19) "down-after-milliseconds"
20) "30000"
21) "info-refresh"
22) "5762"
23) "role-reported"
24) "master"
25) "role-reported-time"
26) "17059758"
27) "config-epoch"
28) "0"
29) "num-slaves"
30) "2"
31) "num-other-sentinels"
32) "12"
33) "quorum"
34) "7"
35) "failover-timeout"
36) "60000"
37) "parallel-syncs"
38) "1"
My starting sentinel.conf for each sentinel is
maxclients 20000
loglevel notice
logfile "/home/redis/logs/sentinel.log"
sentinel monitor shard-172 redis-b-172 7
sentinel down-after-milliseconds shard-172 30000
sentinel failover-timeout shard-172 60000
sentinel parallel-syncs shard-172 1
....
sentinel monitor shard-188 redis-b-188 7
sentinel down-after-milliseconds shard-188 30000
sentinel failover-timeout shard-188 60000
sentinel parallel-syncs shard-188 1
Here's the resulting sentinel.conf (for all sentinels) after a few minutes- note the two slaves:
sentinel monitor shard-172 XXX.XXX.XXX.244 6379 7
sentinel failover-timeout shard-172 60000
sentinel config-epoch shard-172 0
sentinel leader-epoch shard-172 0
sentinel known-slave shard-172 XXX.XXX.XXX.177 6379 <--- True slave of shard-172
sentinel known-sentinel shard-172 ...
...
sentinel monitor shard-188 XXX.XXX.XXX.96 6379 7
sentinel failover-timeout shard-188 60000
sentinel config-epoch shard-188 0
sentinel leader-epoch shard-188 0
sentinel known-slave shard-188 XXX.XXX.XXX.194 6379 <--- True slave of shard-188
sentinel known-slave shard-188 XXX.XXX.XXX.177 6379
sentinel known-sentinel shard-188 ...