HA Proxy load balance + MySQL master/master replication issue

Question

I'm setting up HA Proxy as a load balancer and failover on a master/master + slave replication. I have two xinetd bash scripts listening to the ports 9200 and 9201. The one at the port 9200 check the master status and the one at the port 9201 check the slave status and how behind the master it is.

My HA Proxy config file looks like:

global
    log 127.0.0.1 local0 notice

defaults
    log global
    retries 2
    timeout connect 10000
    timeout server 28800000
    timeout client 28800000

// writes and critical reads goes here
// critical reads are the ones we can't afford any latency at all
listen mariadb-writes
    bind 0.0.0.0:3307
    mode tcp
    option allbackups
    option httpchk
    balance roundrobin

    // 9200 check master status
    server mariadb1 1.1.1.1:3306 check port 9200 // master1
    server mariadb2 2.2.2.2:3306 check port 9200 backup // master2

// heavy reads that we can afford some latency
listen mariadb-reads
    bind 0.0.0.0:3308
    mode tcp
    option allbackups
    option httpchk
    balance roundrobin

    // 9201 check slave status and seconds behind
    server mariadb1 1.1.1.1:3306 check port 9201
    server mariadb2 2.2.2.2:3306 check port 9201
    server mariadb3 3.3.3.3:3306 check port 9201

    // 9200 on backups check the master status
    server mariadb1b 1.1.1.1:3306 check port 9200 backup
    server mariadb2b 2.2.2.2:3306 check port 9200 backup

The reason I use two scripts is because it's the only way I found to solve a broken replication problem, but it is also creating a new issue. I opted to do two different scripts because checking the slave status on my master-master replication could deactivate one of the masters if the other goes down, since it would break the replication. So instead of checking the slave status, on my masters I just write to one of the nodes and keep writing to it if it's up. If for some reason my master goes down, the master backup will hold the requests.

The problem I see with that is if the master1 goes down, master2 will receive the writes and depending how long it stays down, when it goes up, the replication will be far behind and activating it will cause serious data consistency problem until the replication is caught up.

I'm thinking of doing two checkups in the 9200 master script, one will check the slave status and if it's up, check how many seconds it's behind, but if the slave is down, check the master status. In other words, do not return a 503 in case slave is broken, since it can be the second master going down and breaking the replication. But this has some flaws as well since when the master1 is up, replication will be broken until MariaDB reconnect to the other master2, so during this time writes can't be directed to that node. I can configure HA Proxy to wait several seconds before activating a node that is down, but it does not seems the proper solution to me.

Basically I'm trying to figure how to manage the connections if my master1 goes up and HA Proxy forward the requests to it while it's catching up the downtime replicating the data from the master2. Does anyone knows a better approach for this issue?

Rick James Rick James · Accepted Answer · 2018-05-29T23:47:06

(This answer does not address the Monitoring question you pose; instead it fast-forwards to the next step - fixing replication slowdowns.)

Do you have multi-threaded replication? If so, what parameters have you set? (Too high may be as bad as too low.)
Do you have the slowlog turned on in the Slave? With a low value for long_query_time and log_slow_slave_statements = ON?
What are the slowest queries? Let's see them, plus SHOW CREATE TABLE and EXPLAIN SELECT ....

That is, speeding up either the SELECTs on the Slave, or the writes replicated to the Slave, may "eliminate" the problem.

HA Proxy load balance + MySQL master/master replication issue

1 Answers