I have a two node cluster (CentOS7-based), intended to be active/passive with DRBD resources and app resources dependant on them and a cluster ip dependant on the apps through ordering constraints. I have no colocation constraints. Instead all my resources are in the same group so they migrate all together.
There are 2 network interfaces on each node: the one is a LAN and the other a private point-to-point connection. DRBD is configured to use the point-to-point. Both networks are configured into RRP with the LAN the primary Pacemaker/Corosync connection and the point-to-point serving as backup by setting the RRP mode to passive.
Failover by rebooting or powering down the active node works fine and all resources successfully migrate to the survivor. This is where the good news stops though.
I have a ping resource pinging a host reachable on the LAN interface with a location constraint based on the ping to move the resource group to the passive node should the live node loose connectivity to the ping host. This part however does not work correctly.
When I pull the LAN network cable on the active node, the active node cannot ping the ping host anymore and the resources gets stopped on the current active node - as expected. Bear in mind that Corosync can still communicate among one another as the fall back onto the private network due the RRP. The resources however can't be started on the previously passive node (the one that can still connect to the gateway and that should becoming active now) because the DRBD resources remains primary on the node which had its cable pulled so the file systems can't be mounted on the one that should take over. Keep in mind DRBD keeps on being connected on the private network during this time as its plug was not pulled.
I can't figure out why the ping-based location constraint is not migrating the resource group correctly down to the DRBD primary/secondary setting. I was hoping someone here can assist. Following is the state after I pulled the cable and the cluster went as far as it could to migrate before getting stuck.
[root@za-mycluster1 ~]# pcs status
Cluster name: MY_HA
Stack: corosync
Current DC: za-mycluster1.sMY.co.za (version 1.1.20-5.el7-3c4c782f70) - partition with quorum
Last updated: Fri Apr 24 19:12:57 2020
Last change: Fri Apr 24 16:39:45 2020 by hacluster via crmd on za-mycluster1.sMY.co.za
2 nodes configured
14 resources configured
Online: [ za-mycluster1.sMY.co.za za-mycluster2.sMY.co.za ]
Full list of resources:
Master/Slave Set: LV_DATAClone [LV_DATA]
Masters: [ za-mycluster1.sMY.co.za ]
Slaves: [ za-mycluster2.sMY.co.za ]
Resource Group: mygroup
LV_DATAFS (ocf::heartbeat:Filesystem): Stopped
LV_POSTGRESFS (ocf::heartbeat:Filesystem): Stopped
postgresql_9.6 (systemd:postgresql-9.6): Stopped
LV_HOMEFS (ocf::heartbeat:Filesystem): Stopped
myapp (lsb:myapp): Stopped
ClusterIP (ocf::heartbeat:IPaddr2): Stopped
Master/Slave Set: LV_POSTGRESClone [LV_POSTGRES]
Masters: [ za-mycluster1.sMY.co.za ]
Slaves: [ za-mycluster2.sMY.co.za ]
Master/Slave Set: LV_HOMEClone [LV_HOME]
Masters: [ za-mycluster1.sMY.co.za ]
Slaves: [ za-mycluster2.sMY.co.za ]
Clone Set: pingd-clone [pingd]
Started: [ za-mycluster1.sMY.co.za za-mycluster2.sMY.co.za ]
Failed Resource Actions:
* LV_DATAFS_start_0 on za-mycluster2.sMY.co.za 'unknown error' (1): call=57, status=complete, exitreason='Couldn't mount device [/dev/drbd0] as /data',
last-rc-change='Fri Apr 24 16:59:10 2020', queued=0ms, exec=75ms
Daemon Status:
corosync: active/enabled
pacemaker: active/enabled
pcsd: active/enabled
Note the error mounting the DRBD filesystem on the migration target. Looking at the DRBD status at this point shows node 1 is still primary so the DRBD resource never got set to secondary when the other resources got stopped.
[root@za-mycluster1 ~]# cat /proc/drbd
version: 8.4.11-1 (api:1/proto:86-101)
GIT-hash: 66145a308421e9c124ec391a7848ac20203bb03c build by mockbuild@, 2018-11-03 01:26:55
0: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r-----
ns:169816 nr:0 dw:169944 dr:257781 al:8 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
1: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r-----
ns:6108 nr:0 dw:10324 dr:17553 al:14 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
2: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r-----
ns:3368 nr:0 dw:4380 dr:72609 al:6 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
This is what the configuration looks like
[root@za-mycluster1 ~]# pcs config
Cluster Name: MY_HA
Corosync Nodes:
za-mycluster1.sMY.co.za za-mycluster2.sMY.co.za
Pacemaker Nodes:
za-mycluster1.sMY.co.za za-mycluster2.sMY.co.za
Resources:
Master: LV_DATAClone
Meta Attrs: clone-max=2 clone-node-max=1 master-max=1 master-node-max=1 notify=true
Resource: LV_DATA (class=ocf provider=linbit type=drbd)
Attributes: drbd_resource=lv_DATA
Operations: demote interval=0s timeout=90 (LV_DATA-demote-interval-0s)
monitor interval=60s (LV_DATA-monitor-interval-60s)
notify interval=0s timeout=90 (LV_DATA-notify-interval-0s)
promote interval=0s timeout=90 (LV_DATA-promote-interval-0s)
reload interval=0s timeout=30 (LV_DATA-reload-interval-0s)
start interval=0s timeout=240 (LV_DATA-start-interval-0s)
stop interval=0s timeout=100 (LV_DATA-stop-interval-0s)
Group: mygroup
Resource: LV_DATAFS (class=ocf provider=heartbeat type=Filesystem)
Attributes: device=/dev/drbd0 directory=/data fstype=ext4
Operations: monitor interval=20s timeout=40s (LV_DATAFS-monitor-interval-20s)
notify interval=0s timeout=60s (LV_DATAFS-notify-interval-0s)
start interval=0s timeout=60s (LV_DATAFS-start-interval-0s)
stop interval=0s timeout=60s (LV_DATAFS-stop-interval-0s)
Resource: LV_POSTGRESFS (class=ocf provider=heartbeat type=Filesystem)
Attributes: device=/dev/drbd1 directory=/var/lib/pgsql fstype=ext4
Operations: monitor interval=20s timeout=40s (LV_POSTGRESFS-monitor-interval-20s)
notify interval=0s timeout=60s (LV_POSTGRESFS-notify-interval-0s)
start interval=0s timeout=60s (LV_POSTGRESFS-start-interval-0s)
stop interval=0s timeout=60s (LV_POSTGRESFS-stop-interval-0s)
Resource: postgresql_9.6 (class=systemd type=postgresql-9.6)
Operations: monitor interval=60s (postgresql_9.6-monitor-interval-60s)
start interval=0s timeout=100 (postgresql_9.6-start-interval-0s)
stop interval=0s timeout=100 (postgresql_9.6-stop-interval-0s)
Resource: LV_HOMEFS (class=ocf provider=heartbeat type=Filesystem)
Attributes: device=/dev/drbd2 directory=/home fstype=ext4
Operations: monitor interval=20s timeout=40s (LV_HOMEFS-monitor-interval-20s)
notify interval=0s timeout=60s (LV_HOMEFS-notify-interval-0s)
start interval=0s timeout=60s (LV_HOMEFS-start-interval-0s)
stop interval=0s timeout=60s (LV_HOMEFS-stop-interval-0s)
Resource: myapp (class=lsb type=myapp)
Operations: force-reload interval=0s timeout=15 (myapp-force-reload-interval-0s)
monitor interval=60s on-fail=standby timeout=10s (myapp-monitor-interval-60s)
restart interval=0s timeout=120s (myapp-restart-interval-0s)
start interval=0s timeout=60s (myapp-start-interval-0s)
stop interval=0s timeout=60s (myapp-stop-interval-0s)
Resource: ClusterIP (class=ocf provider=heartbeat type=IPaddr2)
Attributes: cidr_netmask=32 ip=192.168.51.185
Operations: monitor interval=30s (ClusterIP-monitor-interval-30s)
start interval=0s timeout=20s (ClusterIP-start-interval-0s)
stop interval=0s timeout=20s (ClusterIP-stop-interval-0s)
Master: LV_POSTGRESClone
Meta Attrs: clone-max=2 clone-node-max=1 master-max=1 master-node-max=1 notify=true
Resource: LV_POSTGRES (class=ocf provider=linbit type=drbd)
Attributes: drbd_resource=lv_postgres
Operations: demote interval=0s timeout=90 (LV_POSTGRES-demote-interval-0s)
monitor interval=60s (LV_POSTGRES-monitor-interval-60s)
notify interval=0s timeout=90 (LV_POSTGRES-notify-interval-0s)
promote interval=0s timeout=90 (LV_POSTGRES-promote-interval-0s)
reload interval=0s timeout=30 (LV_POSTGRES-reload-interval-0s)
start interval=0s timeout=240 (LV_POSTGRES-start-interval-0s)
stop interval=0s timeout=100 (LV_POSTGRES-stop-interval-0s)
Master: LV_HOMEClone
Meta Attrs: clone-max=2 clone-node-max=1 master-max=1 master-node-max=1 notify=true
Resource: LV_HOME (class=ocf provider=linbit type=drbd)
Attributes: drbd_resource=lv_home
Operations: demote interval=0s timeout=90 (LV_HOME-demote-interval-0s)
monitor interval=60s (LV_HOME-monitor-interval-60s)
notify interval=0s timeout=90 (LV_HOME-notify-interval-0s)
promote interval=0s timeout=90 (LV_HOME-promote-interval-0s)
reload interval=0s timeout=30 (LV_HOME-reload-interval-0s)
start interval=0s timeout=240 (LV_HOME-start-interval-0s)
stop interval=0s timeout=100 (LV_HOME-stop-interval-0s)
Clone: pingd-clone
Resource: pingd (class=ocf provider=pacemaker type=ping)
Attributes: dampen=5s host_list=192.168.51.1 multiplier=1000
Operations: monitor interval=10 timeout=60 (pingd-monitor-interval-10)
start interval=0s timeout=60 (pingd-start-interval-0s)
stop interval=0s timeout=20 (pingd-stop-interval-0s)
Stonith Devices:
Fencing Levels:
Location Constraints:
Resource: mygroup
Constraint: location-mygroup
Rule: boolean-op=or score=-INFINITY (id:location-mygroup-rule)
Expression: pingd lt 1 (id:location-mygroup-rule-expr)
Expression: not_defined pingd (id:location-mygroup-rule-expr-1)
Ordering Constraints:
promote LV_DATAClone then start LV_DATAFS (kind:Mandatory) (id:order-LV_DATAClone-LV_DATAFS-mandatory)
promote LV_POSTGRESClone then start LV_POSTGRESFS (kind:Mandatory) (id:order-LV_POSTGRESClone-LV_POSTGRESFS-mandatory)
start LV_POSTGRESFS then start postgresql_9.6 (kind:Mandatory) (id:order-LV_POSTGRESFS-postgresql_9.6-mandatory)
promote LV_HOMEClone then start LV_HOMEFS (kind:Mandatory) (id:order-LV_HOMEClone-LV_HOMEFS-mandatory)
start LV_HOMEFS then start myapp (kind:Mandatory) (id:order-LV_HOMEFS-myapp-mandatory)
start myapp then start ClusterIP (kind:Mandatory) (id:order-myapp-ClusterIP-mandatory)
Colocation Constraints:
Ticket Constraints:
Alerts:
No alerts defined
Resources Defaults:
resource-stickiness=INFINITY
Operations Defaults:
timeout=240s
Cluster Properties:
cluster-infrastructure: corosync
cluster-name: MY_HA
dc-version: 1.1.20-5.el7-3c4c782f70
have-watchdog: false
no-quorum-policy: ignore
stonith-enabled: false
Quorum:
Options:
Any insight will be welcomed.