0
votes

I have a two node cluster (CentOS7-based), intended to be active/passive with DRBD resources and app resources dependant on them and a cluster ip dependant on the apps through ordering constraints. I have no colocation constraints. Instead all my resources are in the same group so they migrate all together.

There are 2 network interfaces on each node: the one is a LAN and the other a private point-to-point connection. DRBD is configured to use the point-to-point. Both networks are configured into RRP with the LAN the primary Pacemaker/Corosync connection and the point-to-point serving as backup by setting the RRP mode to passive.

Failover by rebooting or powering down the active node works fine and all resources successfully migrate to the survivor. This is where the good news stops though.

I have a ping resource pinging a host reachable on the LAN interface with a location constraint based on the ping to move the resource group to the passive node should the live node loose connectivity to the ping host. This part however does not work correctly.

When I pull the LAN network cable on the active node, the active node cannot ping the ping host anymore and the resources gets stopped on the current active node - as expected. Bear in mind that Corosync can still communicate among one another as the fall back onto the private network due the RRP. The resources however can't be started on the previously passive node (the one that can still connect to the gateway and that should becoming active now) because the DRBD resources remains primary on the node which had its cable pulled so the file systems can't be mounted on the one that should take over. Keep in mind DRBD keeps on being connected on the private network during this time as its plug was not pulled.

I can't figure out why the ping-based location constraint is not migrating the resource group correctly down to the DRBD primary/secondary setting. I was hoping someone here can assist. Following is the state after I pulled the cable and the cluster went as far as it could to migrate before getting stuck.

[root@za-mycluster1 ~]# pcs status
Cluster name: MY_HA
Stack: corosync
Current DC: za-mycluster1.sMY.co.za (version 1.1.20-5.el7-3c4c782f70) - partition with quorum
Last updated: Fri Apr 24 19:12:57 2020
Last change: Fri Apr 24 16:39:45 2020 by hacluster via crmd on za-mycluster1.sMY.co.za

2 nodes configured
14 resources configured

Online: [ za-mycluster1.sMY.co.za za-mycluster2.sMY.co.za ]

Full list of resources:

 Master/Slave Set: LV_DATAClone [LV_DATA]
     Masters: [ za-mycluster1.sMY.co.za ]
     Slaves: [ za-mycluster2.sMY.co.za ]
 Resource Group: mygroup
     LV_DATAFS  (ocf::heartbeat:Filesystem):    Stopped
     LV_POSTGRESFS      (ocf::heartbeat:Filesystem):    Stopped
     postgresql_9.6     (systemd:postgresql-9.6):       Stopped
     LV_HOMEFS  (ocf::heartbeat:Filesystem):    Stopped
     myapp (lsb:myapp):  Stopped
     ClusterIP  (ocf::heartbeat:IPaddr2):       Stopped
 Master/Slave Set: LV_POSTGRESClone [LV_POSTGRES]
     Masters: [ za-mycluster1.sMY.co.za ]
     Slaves: [ za-mycluster2.sMY.co.za ]
 Master/Slave Set: LV_HOMEClone [LV_HOME]
     Masters: [ za-mycluster1.sMY.co.za ]
     Slaves: [ za-mycluster2.sMY.co.za ]
 Clone Set: pingd-clone [pingd]
     Started: [ za-mycluster1.sMY.co.za za-mycluster2.sMY.co.za ]

Failed Resource Actions:
* LV_DATAFS_start_0 on za-mycluster2.sMY.co.za 'unknown error' (1): call=57, status=complete, exitreason='Couldn't mount device [/dev/drbd0] as /data',
    last-rc-change='Fri Apr 24 16:59:10 2020', queued=0ms, exec=75ms

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

Note the error mounting the DRBD filesystem on the migration target. Looking at the DRBD status at this point shows node 1 is still primary so the DRBD resource never got set to secondary when the other resources got stopped.

[root@za-mycluster1 ~]# cat /proc/drbd
version: 8.4.11-1 (api:1/proto:86-101)
GIT-hash: 66145a308421e9c124ec391a7848ac20203bb03c build by mockbuild@, 2018-11-03 01:26:55
 0: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r-----
    ns:169816 nr:0 dw:169944 dr:257781 al:8 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
 1: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r-----
    ns:6108 nr:0 dw:10324 dr:17553 al:14 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
 2: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r-----
    ns:3368 nr:0 dw:4380 dr:72609 al:6 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0

This is what the configuration looks like

[root@za-mycluster1 ~]# pcs config
Cluster Name: MY_HA
Corosync Nodes:
 za-mycluster1.sMY.co.za za-mycluster2.sMY.co.za
Pacemaker Nodes:
 za-mycluster1.sMY.co.za za-mycluster2.sMY.co.za

Resources:
 Master: LV_DATAClone
  Meta Attrs: clone-max=2 clone-node-max=1 master-max=1 master-node-max=1 notify=true
  Resource: LV_DATA (class=ocf provider=linbit type=drbd)
   Attributes: drbd_resource=lv_DATA
   Operations: demote interval=0s timeout=90 (LV_DATA-demote-interval-0s)
               monitor interval=60s (LV_DATA-monitor-interval-60s)
               notify interval=0s timeout=90 (LV_DATA-notify-interval-0s)
               promote interval=0s timeout=90 (LV_DATA-promote-interval-0s)
               reload interval=0s timeout=30 (LV_DATA-reload-interval-0s)
               start interval=0s timeout=240 (LV_DATA-start-interval-0s)
               stop interval=0s timeout=100 (LV_DATA-stop-interval-0s)
 Group: mygroup
  Resource: LV_DATAFS (class=ocf provider=heartbeat type=Filesystem)
   Attributes: device=/dev/drbd0 directory=/data fstype=ext4
   Operations: monitor interval=20s timeout=40s (LV_DATAFS-monitor-interval-20s)
               notify interval=0s timeout=60s (LV_DATAFS-notify-interval-0s)
               start interval=0s timeout=60s (LV_DATAFS-start-interval-0s)
               stop interval=0s timeout=60s (LV_DATAFS-stop-interval-0s)
  Resource: LV_POSTGRESFS (class=ocf provider=heartbeat type=Filesystem)
   Attributes: device=/dev/drbd1 directory=/var/lib/pgsql fstype=ext4
   Operations: monitor interval=20s timeout=40s (LV_POSTGRESFS-monitor-interval-20s)
               notify interval=0s timeout=60s (LV_POSTGRESFS-notify-interval-0s)
               start interval=0s timeout=60s (LV_POSTGRESFS-start-interval-0s)
               stop interval=0s timeout=60s (LV_POSTGRESFS-stop-interval-0s)
  Resource: postgresql_9.6 (class=systemd type=postgresql-9.6)
   Operations: monitor interval=60s (postgresql_9.6-monitor-interval-60s)
               start interval=0s timeout=100 (postgresql_9.6-start-interval-0s)
               stop interval=0s timeout=100 (postgresql_9.6-stop-interval-0s)
  Resource: LV_HOMEFS (class=ocf provider=heartbeat type=Filesystem)
   Attributes: device=/dev/drbd2 directory=/home fstype=ext4
   Operations: monitor interval=20s timeout=40s (LV_HOMEFS-monitor-interval-20s)
               notify interval=0s timeout=60s (LV_HOMEFS-notify-interval-0s)
               start interval=0s timeout=60s (LV_HOMEFS-start-interval-0s)
               stop interval=0s timeout=60s (LV_HOMEFS-stop-interval-0s)
  Resource: myapp (class=lsb type=myapp)
   Operations: force-reload interval=0s timeout=15 (myapp-force-reload-interval-0s)
               monitor interval=60s on-fail=standby timeout=10s (myapp-monitor-interval-60s)
               restart interval=0s timeout=120s (myapp-restart-interval-0s)
               start interval=0s timeout=60s (myapp-start-interval-0s)
               stop interval=0s timeout=60s (myapp-stop-interval-0s)
  Resource: ClusterIP (class=ocf provider=heartbeat type=IPaddr2)
   Attributes: cidr_netmask=32 ip=192.168.51.185
   Operations: monitor interval=30s (ClusterIP-monitor-interval-30s)
               start interval=0s timeout=20s (ClusterIP-start-interval-0s)
               stop interval=0s timeout=20s (ClusterIP-stop-interval-0s)
 Master: LV_POSTGRESClone
  Meta Attrs: clone-max=2 clone-node-max=1 master-max=1 master-node-max=1 notify=true
  Resource: LV_POSTGRES (class=ocf provider=linbit type=drbd)
   Attributes: drbd_resource=lv_postgres
   Operations: demote interval=0s timeout=90 (LV_POSTGRES-demote-interval-0s)
               monitor interval=60s (LV_POSTGRES-monitor-interval-60s)
               notify interval=0s timeout=90 (LV_POSTGRES-notify-interval-0s)
               promote interval=0s timeout=90 (LV_POSTGRES-promote-interval-0s)
               reload interval=0s timeout=30 (LV_POSTGRES-reload-interval-0s)
               start interval=0s timeout=240 (LV_POSTGRES-start-interval-0s)
               stop interval=0s timeout=100 (LV_POSTGRES-stop-interval-0s)
 Master: LV_HOMEClone
  Meta Attrs: clone-max=2 clone-node-max=1 master-max=1 master-node-max=1 notify=true
  Resource: LV_HOME (class=ocf provider=linbit type=drbd)
   Attributes: drbd_resource=lv_home
   Operations: demote interval=0s timeout=90 (LV_HOME-demote-interval-0s)
               monitor interval=60s (LV_HOME-monitor-interval-60s)
               notify interval=0s timeout=90 (LV_HOME-notify-interval-0s)
               promote interval=0s timeout=90 (LV_HOME-promote-interval-0s)
               reload interval=0s timeout=30 (LV_HOME-reload-interval-0s)
               start interval=0s timeout=240 (LV_HOME-start-interval-0s)
               stop interval=0s timeout=100 (LV_HOME-stop-interval-0s)
 Clone: pingd-clone
  Resource: pingd (class=ocf provider=pacemaker type=ping)
   Attributes: dampen=5s host_list=192.168.51.1 multiplier=1000
   Operations: monitor interval=10 timeout=60 (pingd-monitor-interval-10)
               start interval=0s timeout=60 (pingd-start-interval-0s)
               stop interval=0s timeout=20 (pingd-stop-interval-0s)

Stonith Devices:
Fencing Levels:

Location Constraints:
  Resource: mygroup
    Constraint: location-mygroup
      Rule: boolean-op=or score=-INFINITY  (id:location-mygroup-rule)
        Expression: pingd lt 1  (id:location-mygroup-rule-expr)
        Expression: not_defined pingd  (id:location-mygroup-rule-expr-1)
Ordering Constraints:
  promote LV_DATAClone then start LV_DATAFS (kind:Mandatory) (id:order-LV_DATAClone-LV_DATAFS-mandatory)
  promote LV_POSTGRESClone then start LV_POSTGRESFS (kind:Mandatory) (id:order-LV_POSTGRESClone-LV_POSTGRESFS-mandatory)
  start LV_POSTGRESFS then start postgresql_9.6 (kind:Mandatory) (id:order-LV_POSTGRESFS-postgresql_9.6-mandatory)
  promote LV_HOMEClone then start LV_HOMEFS (kind:Mandatory) (id:order-LV_HOMEClone-LV_HOMEFS-mandatory)
  start LV_HOMEFS then start myapp (kind:Mandatory) (id:order-LV_HOMEFS-myapp-mandatory)
  start myapp then start ClusterIP (kind:Mandatory) (id:order-myapp-ClusterIP-mandatory)
Colocation Constraints:
Ticket Constraints:

Alerts:
 No alerts defined

Resources Defaults:
 resource-stickiness=INFINITY
Operations Defaults:
 timeout=240s

Cluster Properties:
 cluster-infrastructure: corosync
 cluster-name: MY_HA
 dc-version: 1.1.20-5.el7-3c4c782f70
 have-watchdog: false
 no-quorum-policy: ignore
 stonith-enabled: false

Quorum:
  Options:

Any insight will be welcomed.

1

1 Answers

0
votes

Assuming the Filesystem resources in your group exist on the DRBD devices outside of the group, you will need at least one order and one colocation constraint per DRBD device telling the cluster that it can only start mygroup after the DRBD devices are promoted to primary and on the node where they are primary. Your ping resource is working, as you're seeing mygroup stop and attempt to start on the peer, but it's failing to start because nothing is telling the DRBD Primary to move with the group, and that's where the Filesystems live.

Try adding the following constraints to the cluster:

# pcs cluster cib drbd_constraints

# pcs -f drbd_constraints constraint colocation add mygroup LV_DATAClone INFINITY with-rsc-role=Master
# pcs -f drbd_constraints constraint order promote LV_DATAClone then start mygroup

# pcs -f drbd_constraints constraint colocation add mygroup LV_POSTGRESClone INFINITY with-rsc-role=Master
# pcs -f drbd_constraints constraint order promote LV_POSTGRESClone then start mygroup

# pcs -f drbd_constraints constraint colocation add mygroup LV_HOMEFS INFINITY with-rsc-role=Master
# pcs -f drbd_constraints constraint order promote LV_HOMEFS then start mygroup

# pcs cluster push cib drbd_constraints