2
votes

I have High availability cluster that configured with DRBD resource.

 Master/Slave Set: RVClone01 [RV_data01]
     Masters: [ rvpcmk01-cr ]
     Slaves: [ rvpcmk02-cr ]

I perform a test that disconnect one of the network adapter that connect between the DRBD network interfaces (for example shutdown the network adapter). Now the cluster display statuses that everything o.k BUT the status of the DRBD when running "drbd-overview" shows in primary server:

[root@rvpcmk01 ~]# drbd-overview
 0:drbd0/0  WFConnection Primary/Unknown UpToDate/DUnknown /opt ext4 30G 13G 16G 45%

and in the secondary server:

[root@rvpcmk02 ~]# drbd-overview
 0:drbd0/0  StandAlone Secondary/Unknown UpToDate/DUnknown

Now I have few questions: 1. Why cluster doesn't know about the problem with the DRBD? 2. Why when I put the network adapter that was down to UP again and connect back the connection between the DRBD the DRBD didn't handle this failure and sync back the DRBD when connection is o.k? 3. I saw an article that talk about "Solve a DRBD split-brain" - https://www.hastexo.com/resources/hints-and-kinks/solve-drbd-split-brain-4-steps/ in this article it's explain how to get over a problem of disconnection and resync the DRBD. BUT how I should know that this kind of problem exist?

I hope I explain my case clearly and provide enough information about what I have and what I need...

1

1 Answers

1
votes

1) You aren't using fencing/STONITH devices in Pacemaker or DRBD, which is why nothing happens when you unplug your network interface that DRBD is using. This isn't a scenario that Pacemaker will react to without defining fencing policies within DRBD, and STONITH devices within Pacemaker.

2) You likely are only using one ring for the Corosync communications (the same as the DRBD device), which will cause the Secondary to promote to Primary (introducing a split-brain in DRBD), until the cluster communications are reconnected and realize they have two masters, demoting one to Secondary. Again, fencing/STONITH would prevent/handle this.

3) You can set up the split-brain notification handler in your DRBD configuration.

Once you have STONITH/fencing devices setup in Pacemaker, you would add the following definitions to your DRBD configuration to "fix" all the issues you mentioned in your question:

resource <resource>
  handlers {
    split-brain "/usr/lib/drbd/notify-split-brain.sh root";
    fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
    after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh";
    ...
  }
  disk {
    fencing resource-and-stonith;
    ...
  }
  ...
}

Setting up fencing/STONITH in Pacemaker is a little too dependent on your hardware/software for me to give you pointers on setting that up for your cluster. This should get you pointed in the right direction: http://clusterlabs.org/doc/crm_fencing.html

Hope that helps!