I am trying to cluster 2 computers together with Pacemaker/Corosync. The only resource that they share is an ocf:heartbeat:IPaddr this is the main problem:
Since there are only two nodes failover will only occur if the no-quorum-policy=ignore
.
When the network cable is pulled from node A, corosync on node A binds to 127.0.0.1 and pacemaker believes that node A is still online and the node B is the one offline.
Pacemaker attempts to start the IPaddr on Node A but it fails to start because there is no network connection. Node B on the other hand recognizes that node B is offline and if the IPaddr service was started on node A it will start it on itself (node B) successfully.
However, since the service failed to start on node A it enters a fatal state and has to be rebooted to rejoin the cluster. (you could restart some of the needed services instead.)
1 workaround is the set start-failure-is-fatal="false"
which makes node A continue to try to start the IPaddr service until it is successful. the problem with this is that once it is successful you have a ip conflict between the two nodes until they re cluster and one of the gives up the resource.
I am playing around with the idea of having a node attribute that mirrors cat /sys/class/net/eth0/carrier
which is 1 when the cable is connected and zero when it is disconnected and then having a location rule that says if "connected" == zero don't start service kind of thing, but we'll see.
Any thoughts or ideas would be greatly appreciated.