HA - Pacemaker - Is there a way to clean automatically failed actions after X sec/min/hour?

Question

I'm using Pacemaker + Corosync in Centos7 When one of my resource failed/stopped I/m getting a failed action message:

Master/Slave Set: myoptClone01 [myopt_data01]
     Masters: [ pcmk01-cr ]
     Slaves: [ pcmk02-cr ]
 myopt_fs01     (ocf::heartbeat:Filesystem):    Started pcmk01-cr
 myopt_VIP01    (ocf::heartbeat:IPaddr2):       Started pcmk01-cr
 ServicesResource        (ocf::heartbeat:RADviewServices):       Started pcmk01-cr

Failed Actions:
* ServicesResource_monitor_120000 on pcmk02-cr 'unknown error' (1): call=141, status=complete, exitreason='none',
    last-rc-change='Mon Jan 30 10:19:36 2017', queued=0ms, exec=142ms

Is there a way to clean automatically the failed actions after X sec/min/hour?

Dok Dok · Accepted Answer · 2017-01-30T17:45:14

Look into the 'failure-timeout' resource option. This will automatically cleanup the failed action if no further failures for the particular resource has occurred within the value of failure-timeout.

I believe the failure-timeout is calculated during the cluster-recheck-interval. Which means that even if you have the failure-timeout configured to 1 minute it may still take up to 15 minutes and 59 seconds to clear the failed action with Pacemaker's default 15 minute cluster-recheck-interval.

More information:

http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/s-failure-migration.html

http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/s-resource-options.html

HA - Pacemaker - Is there a way to clean automatically failed actions after X sec/min/hour?

1 Answers