0
votes

I'm trying to use sentinel for failover in large redis fleet (12 sentinels, 500+ shard of one master and one slave each). I'm encountering a very strange issue where my sentinels repeatedly issue the command +fix-slave-config to certain redis nodes. I did not notice this happening at smaller scale, for what it is worth.

I've noticed two specific issues:

  1. +fix-slave-config messages, as stated above
  2. The sentinel.conf shows certain slaves having two masters (they should only have one)

The fleet in it's starting state has a certain slave node XXX.XXX.XXX.177 with a master XXX.XXX.XXX.244 (together, they comprise shard 188 in the fleet). Without any node outages, the master of the slave is switched to XXX.XXX.XXX.96 (master for shard 188) and then back, and then forth. This is verified by sshing into the slave and master nodes and checking redis-cli info. All Redis nodes started in the correct configuration. All Sentinel nodes had the correct configuration in their sentinel.conf. Each Sentinel has the exact same list of masters when I query them after each of these slave->master changes.

Across my 12 sentinels, the following is logged. Every minute, there is a +fix-slave-config message sent:

Sentinel #8: 20096:X 22 Oct 01:41:49.793 * +fix-slave-config slave XXX.XXX.XXX.177:6379 XXX.XXX.XXX.177 6379 @ shard-188 XXX.XXX.XXX.96 6379
Sentinel #1: 9832:X 22 Oct 01:42:50.795 * +fix-slave-config slave XXX.XXX.XXX.177:6379 XXX.XXX.XXX.177 6379 @ shard-172 XXX.XXX.XXX.244 6379
Sentinel #6: 20528:X 22 Oct 01:43:52.458 * +fix-slave-config slave XXX.XXX.XXX.177:6379 XXX.XXX.XXX.177 6379 @ shard-188 XXX.XXX.XXX.96 6379
Sentinel #10: 20650:X 22 Oct 01:43:52.464 * +fix-slave-config slave XXX.XXX.XXX.177:6379 XXX.XXX.XXX.177 6379 @ shard-188 XXX.XXX.XXX.96 6379
Sentinel #2: 20838:X 22 Oct 01:44:53.489 * +fix-slave-config slave XXX.XXX.XXX.177:6379 XXX.XXX.XXX.177 6379 @ shard-172 XXX.XXX.XXX.244 6379

Here's the output of the SENTINEL MASTERS command. The strange thing is that shard-188 has two slaves, when in fact it should only have 1. The output looks the same for when XXX.XXX.XXX.177 is under shard-172 and shard-182.

Case 1) XXX.XXX.XXX.244 is master of XXX.XXX.XXX.177

183)  1) "name"
      2) "shard-172"
      3) "ip"
      4) "XXX.XXX.XXX.244"
      5) "port"
      6) "6379"
      7) "runid"
      8) "ca02da1f0002a25a880e6765aed306b1857ae2f7"
      9) "flags"
     10) "master"
     11) "pending-commands"
     12) "0"
     13) "last-ping-sent"
     14) "0"
     15) "last-ok-ping-reply"
     16) "14"
     17) "last-ping-reply"
     18) "14"
     19) "down-after-milliseconds"
     20) "30000"
     21) "info-refresh"
     22) "5636"
     23) "role-reported"
     24) "master"
     25) "role-reported-time"
     26) "17154406"
     27) "config-epoch"
     28) "0"
     29) "num-slaves"
     30) "1"
     31) "num-other-sentinels"
     32) "12"
     33) "quorum"
     34) "7"
     35) "failover-timeout"
     36) "60000"
     37) "parallel-syncs"
     38) "1"
72)  1) "name"
      2) "shard-188"
      3) "ip"
      4) "XXX.XXX.XXX.96"
      5) "port"
      6) "6379"
      7) "runid"
      8) "95cd3a457ef71fc91ff1a1c5a6d5d4496b266167"
      9) "flags"
     10) "master"
     11) "pending-commands"
     12) "0"
     13) "last-ping-sent"
     14) "0"
     15) "last-ok-ping-reply"
     16) "927"
     17) "last-ping-reply"
     18) "927"
     19) "down-after-milliseconds"
     20) "30000"
     21) "info-refresh"
     22) "5333"
     23) "role-reported"
     24) "master"
     25) "role-reported-time"
     26) "17154312"
     27) "config-epoch"
     28) "0"
     29) "num-slaves"
     30) "2"
     31) "num-other-sentinels"
     32) "12"
     33) "quorum"
     34) "7"
     35) "failover-timeout"
     36) "60000"
     37) "parallel-syncs"
     38) "1"

Case 2) XXX.XXX.XXX.96 is master of XXX.XXX.XXX.177

79)  1) "name"
      2) "shard-172"
      3) "ip"
      4) "XXX.XXX.XXX.244"
      5) "port"
      6) "6379"
      7) "runid"
      8) "ca02da1f0002a25a880e6765aed306b1857ae2f7"
      9) "flags"
     10) "master"
     11) "pending-commands"
     12) "0"
     13) "last-ping-sent"
     14) "0"
     15) "last-ok-ping-reply"
     16) "1012"
     17) "last-ping-reply"
     18) "1012"
     19) "down-after-milliseconds"
     20) "30000"
     21) "info-refresh"
     22) "1261"
     23) "role-reported"
     24) "master"
     25) "role-reported-time"
     26) "17059720"
     27) "config-epoch"
     28) "0"
     29) "num-slaves"
     30) "1"
     31) "num-other-sentinels"
     32) "12"
     33) "quorum"
     34) "7"
     35) "failover-timeout"
     36) "60000"
     37) "parallel-syncs"
     38) "1"
273)  1) "name"
      2) "shard-188"
      3) "ip"
      4) "XXX.XXX.XXX.96"
      5) "port"
      6) "6379"
      7) "runid"
      8) "95cd3a457ef71fc91ff1a1c5a6d5d4496b266167"
      9) "flags"
     10) "master"
     11) "pending-commands"
     12) "0"
     13) "last-ping-sent"
     14) "0"
     15) "last-ok-ping-reply"
     16) "886"
     17) "last-ping-reply"
     18) "886"
     19) "down-after-milliseconds"
     20) "30000"
     21) "info-refresh"
     22) "5762"
     23) "role-reported"
     24) "master"
     25) "role-reported-time"
     26) "17059758"
     27) "config-epoch"
     28) "0"
     29) "num-slaves"
     30) "2"
     31) "num-other-sentinels"
     32) "12"
     33) "quorum"
     34) "7"
     35) "failover-timeout"
     36) "60000"
     37) "parallel-syncs"
     38) "1"

My starting sentinel.conf for each sentinel is

maxclients 20000
loglevel notice
logfile "/home/redis/logs/sentinel.log"
sentinel monitor shard-172 redis-b-172  7
sentinel down-after-milliseconds shard-172 30000
sentinel failover-timeout shard-172 60000
sentinel parallel-syncs shard-172 1
....
sentinel monitor shard-188 redis-b-188  7
sentinel down-after-milliseconds shard-188 30000
sentinel failover-timeout shard-188 60000
sentinel parallel-syncs shard-188 1

Here's the resulting sentinel.conf (for all sentinels) after a few minutes- note the two slaves:

sentinel monitor shard-172 XXX.XXX.XXX.244 6379 7
sentinel failover-timeout shard-172 60000
sentinel config-epoch shard-172 0
sentinel leader-epoch shard-172 0
sentinel known-slave shard-172 XXX.XXX.XXX.177 6379 <--- True slave of shard-172
sentinel known-sentinel shard-172 ...
...
sentinel monitor shard-188 XXX.XXX.XXX.96 6379 7
sentinel failover-timeout shard-188 60000
sentinel config-epoch shard-188 0
sentinel leader-epoch shard-188 0
sentinel known-slave shard-188 XXX.XXX.XXX.194 6379 <--- True slave of shard-188
sentinel known-slave shard-188 XXX.XXX.XXX.177 6379
sentinel known-sentinel shard-188 ... 
1

1 Answers

0
votes

This is what I call an "ant problem": you have two (or more) pods (master+slave) intermingled. You indicate this when you show that one of your pods has multiple slaves.

Specifically:

Here's the output of the SENTINEL MASTERS command. The strange thing is that shard-188 has two slaves, when in fact it should only have 1.

What you need to do is:

  1. Remove those parts (shard-188 and shard-???) from all sentinels via sentinel remove shard-NNN
  2. Bring the pods that those slaves are in down
  3. Configure them correctly (proper slaveof commands/configuration)
  4. Bring them back online
  5. Ensure they each have only the one correct slave
  6. Add them back into the Sentinels via sentinel monitor ...

Now technically you could use the sentinel reset command but you'd be facing potential timing issues so removing them from Sentinel is the sure-fire route to go. Optionally you can leave the master/slave up and simply reconfigure the slave appropriately. If you go that route wait a few minutes and check the slave listing before moving into step 6.