This Redis Cluster have 240 nodes (120 masters and 120 slaves), and works well for a long time. But now it get a Master Slave switch almost several hours.
I get some log from Redis Server.
5c541d3a765e087af7775ba308f51ffb2aa54151 10.12.28.165:6502 13306:M 08 Mar 18:55:02.597 * Background append only file rewriting started by pid 15396 13306:M 08 Mar 18:55:41.636 # Cluster state changed: fail 13306:M 08 Mar 18:55:45.321 # Connection with slave client id #112948 lost. 13306:M 08 Mar 18:55:46.243 # Configuration change detected. Reconfiguring myself as a replica of afb6e012db58bd26a7c96182b04f0a2ba6a45768 13306:S 08 Mar 18:55:47.134 * AOF rewrite child asks to stop sending diffs. 15396:C 08 Mar 18:55:47.134 * Parent agreed to stop sending diffs. Finalizing AOF... 15396:C 08 Mar 18:55:47.134 * Concatenating 0.02 MB of AOF diff received from parent. 15396:C 08 Mar 18:55:47.135 * SYNC append only file rewrite performed 15396:C 08 Mar 18:55:47.186 * AOF rewrite: 4067 MB of memory used by copy-on-write 13306:S 08 Mar 18:55:47.209 # Cluster state changed: ok
5ac747878f881349aa6a62b179176ddf603e034c 10.12.30.107:6500 22825:M 08 Mar 18:55:30.534 * FAIL message received from da493af5bb3d15fc563961de09567a47787881be about 5c541d3a765e087af7775ba308f51ffb2aa54151 22825:M 08 Mar 18:55:31.440 # Failover auth granted to afb6e012db58bd26a7c96182b04f0a2ba6a45768 for epoch 323 22825:M 08 Mar 18:55:41.587 * Background append only file rewriting started by pid 23628 22825:M 08 Mar 18:56:24.200 # Cluster state changed: fail 22825:M 08 Mar 18:56:30.002 # Connection with slave client id #382416 lost. 22825:M 08 Mar 18:56:30.830 * FAIL message received from 0decbe940c6f4d4330fae5a9c129f1ad4932405d about 5ac747878f881349aa6a62b179176ddf603e034c 22825:M 08 Mar 18:56:30.840 # Failover auth denied to d46f95da06cfcd8ea5eaa15efabff5bd5e99df55: its master is up 22825:M 08 Mar 18:56:30.843 # Configuration change detected. Reconfiguring myself as a replica of d46f95da06cfcd8ea5eaa15efabff5bd5e99df55 22825:S 08 Mar 18:56:31.030 * Clear FAIL state for node 5ac747878f881349aa6a62b179176ddf603e034c: slave is reachable again. 22825:S 08 Mar 18:56:31.030 * Clear FAIL state for node 5c541d3a765e087af7775ba308f51ffb2aa54151: slave is reachable again. 22825:S 08 Mar 18:56:31.294 # Cluster state changed: ok 22825:S 08 Mar 18:56:31.595 * Connecting to MASTER 10.12.30.104:6404 22825:S 08 Mar 18:56:31.671 * MASTER SLAVE sync started 22825:S 08 Mar 18:56:31.671 * Non blocking connect for SYNC fired the event. 22825:S 08 Mar 18:56:31.672 * Master replied to PING, replication can continue... 22825:S 08 Mar 18:56:31.673 * Partial resynchronization not possible (no cached master) 22825:S 08 Mar 18:56:31.691 * AOF rewrite child asks to stop sending diffs.
It appends that Redis Master Slave Swtich happend after Aof rewtiting.
Here is the config of this cluster.
daemonize no tcp-backlog 511 timeout 0 tcp-keepalive 60 loglevel notice databases 16 dir "/var/cachecloud/data" stop-writes-on-bgsave-error no repl-timeout 60 repl-ping-slave-period 10 repl-disable-tcp-nodelay no repl-backlog-size 10000000 repl-backlog-ttl 7200 slave-serve-stale-data yes slave-read-only yes slave-priority 100 lua-time-limit 5000 slowlog-log-slower-than 10000 slowlog-max-len 128 hash-max-ziplist-entries 512 hash-max-ziplist-value 64 list-max-ziplist-entries 512 list-max-ziplist-value 64 set-max-intset-entries 512 zset-max-ziplist-entries 128 zset-max-ziplist-value 64 activerehashing yes client-output-buffer-limit normal 0 0 0 client-output-buffer-limit slave 512mb 128mb 60 client-output-buffer-limit pubsub 32mb 8mb 60 hz 10 port 6401 maxmemory 13000mb maxmemory-policy volatile-lru appendonly yes appendfsync no appendfilename "appendonly-6401.aof" dbfilename "dump-6401.rdb" aof-rewrite-incremental-fsync yes no-appendfsync-on-rewrite yes auto-aof-rewrite-min-size 62500kb auto-aof-rewrite-percentage 86 rdbcompression yes rdbchecksum yes repl-diskless-sync no repl-diskless-sync-delay 5 maxclients 10000 hll-sparse-max-bytes 3000 min-slaves-to-write 0 min-slaves-max-lag 10 aof-load-truncated yes notify-keyspace-events "" bind 10.12.26.226 protected-mode no cluster-enabled yes cluster-node-timeout 15000 cluster-slave-validity-factor 10 cluster-migration-barrier 1 cluster-config-file "nodes-6401.conf" cluster-require-full-coverage no rename-command FLUSHDB "" rename-command FLUSHALL "" rename-command KEYS ""
In my option, aof rewrite will not effect the Redis Main Thread. BUT this seems make this node not response other nodes' Ping.