2
votes

I'm trying to set up an automatic failover system in a 3 nodes redis cluster. I installed redis-sentinel on each of these nodes (juste like this guy : http://www.symantec.com/connect/blogs/configuring-redis-high-availability). Everything is fine as long as I have two or three nodes. The problem is that whenever there's only onte node remaining and that it's a slave, it does not get elected as master automatically. The quorum is set to 1, therefore the last node detects the odown of the master but can't vote for the failover since there's no majority.

To overcome this (surprising) issue, I wrote a little script that ask the other nodes for their masters, and if they don't answer I set the current node as the master. This script is called within the redis-sentinel.conf file, as a notification script. However ... As soon as the redis-sentinel service is started, this configuration is "erased" ! If I look at the configuration file in /etc, the "sentinel notification-script" line has disappeared (redis-sentinel rewrites its configuration file so why not) BUT the configuration I wrote is no longer available :

1)  1) "name"
    2) "mymaster"
    3) "ip"
    4) "x.x.x.x"
    5) "port"
    6) "6379"
    7) "runid"
    8) "somerunid"
    9) "flags"
   10) "master"
   11) "pending-commands"
   12) "0"
   13) "last-ping-sent"
   14) "0"
   15) "last-ok-ping-reply"
   16) "395"
   17) "last-ping-reply"
   18) "395"
   19) "down-after-milliseconds"
   20) "30000"
   21) "info-refresh"
   22) "674"
   23) "role-reported"
   24) "master"
   25) "role-reported-time"
   26) "171302"
   27) "config-epoch"
   28) "0"
   29) "num-slaves"
   30) "1"
   31) "num-other-sentinels"
   32) "1"
   33) "quorum"
   34) "1"
   35) "failover-timeout"
   36) "180000"
   37) "parallel-syncs"
   38) "1"

That is the result of the sentinel-masters command. The only thing is that I previously set the "down-after-milliseconds" to 5000 and the "failover-timeout" to 10000 ...

I don't know if anyone has met anything similar ? Well, should someone has a little idea about wwhat's happening, I'd be glad about it ;)

2
What did you do to overcome this problem can you please tell me it would be really really heplful, i am facing same problem and haven't gotten anything ??Sudhanshu Gaur

2 Answers

3
votes

This is a reason to not place your sentinels on your redis instance nodes. Think of them as monitoring agents. You wouldn't place your website monitor on the same node running your website and expect to catch the node death. The same is expected w/Sentinel.

The proper route to sentinel monitoring is to ideally run them from the clients, and if that isn't possible or workable, then from dedicated nodes as close to the clients as possible.

As antirez said, you need to have enough sentinels to have the election. There are two elections: 1: deciding on the new master and 2: deciding which sentinel handles the promotion. In your scenario you only have one sentinel, but to elect a sentinel to handle the promotion your sentinel needs votes from a quorum of Sentinels. This number is a majority of all sentinels seen. In your case it needs two sentinels to vote before an election can take place. This quorum number is not configurable and unaffected by the quorum setting. This is in place to reduce the chances of multiple masters.

I would also strongly advise against setting a quorum to be less than half+1 of your sentinels. This can lead to split brain operation where you have two masters. Or in your case you could have three. If you lost connectivity between your master and the two slaves but clients still had connectivity your settings could trigger split brain - where a slave was promoted and new connections talked to that master while existing ones continue talking to the original. Thus you have valid data in two masters which likely conflict with each other.

The author of that Symantec article only consider the Redis daemon dying, not the node. Thus it really isn't an HA setup.

2
votes

the quorum is only used to reach the ODOWN state, that triggers the failover. For the failover to actually happen the slave must be voted by a majority, so a single node can't get elected. If you have such a requirement, and you don't care about only the majority side being able to continue in your cluster (this means unbound data loss in the minority side if clients get partitioned with a minority where there is a master), you can just add sentinels in your clients machines as well, this way the total num of Sentinels is, for example, 5, and even if two Redis nodes are down, the only remaining node plus two sentinels running client side are enough to get majority of 3. Unfortunately the Sentinel documentation is not complete enough to explain this stuff. There is all the info to get the picture right, but no examples for a faster reading / deploying.