0
votes

Is it possible to configure Pacemaker resource group in such way that in case of resource timeout on invoking any operation (monitor, start, stop may be ignored), cluster manager will migrate resources to a Standby node? If there will again problem on Standby node, it will bring resources back to Primary node, etc. It will continue retrying for 5 hours or even indefinitely.

In real situation when external systems are down, keeping restating is the only way to make service back to available asap.

Long story here: I'm building resource managers for OCI Public and Private IP. In the Oracle Cloud assignment of floating routable IP and internal one requires interaction with OCI API to configure virtual network side. I followed Dummy exemplary code; did few mistakes an errors to finally have code passed to production. Resource group looks as following: floating IPs, routes, and systemd service. I've configured migration-threshold to 5, and resource-stickiness as 100.

 Resource Group: libreswan
 ipsec_cluster_routing_no1  (ocf::heartbeat:Route): Started node1
 ipsec_cluster_public_ip    (ocf::heartbeat:oci_publicip):  Started node1
 ipsec_cluster_private_ip_no1   (ocf::heartbeat:oci_privateip): Started node1
 ipsec_cluster_private_ip_no2   (ocf::heartbeat:oci_privateip): Started node1
 ipsec_cluster_inet_ip_no1  (ocf::heartbeat:IPaddr2):   Started node1
 ipsec_cluster_inet_ip_no2  (ocf::heartbeat:IPaddr2):   Started node1
 ipsec_cluster_routing_no2  (ocf::heartbeat:Route): Started node1
 ipsec_cluster_libreswan    (systemd:ipsec):    Started node1

Recently due to temporary unavailability of OCI API, cluster manager stopped whole resource group due to 30 sec. timeout on monitor() operation on one of oci_privateip resources.

In logs, I see 5 times retry sequence: monitor, stop, start. But after that cluster manager gives up, leaving resources in Stopped state. I'd like cluster manager to keep retrying.

1

1 Answers

0
votes

SOLVED.

  sudo pcs resource meta $res failure-timeout=120
  sudo pcs resource meta $res migration-threshold=5

makes "failed" node ready to take back resources after 120 seconds. Failed node before giving up will retry 5 times, so with 30 sec timeout will keep retrying for 2.5 minutes.

More info: https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/configuring_the_red_hat_high_availability_add-on_with_pacemaker/s1-resourceopts-haar