I have a backup script that runs every 2 hours. I want to use CloudWatch to track the successful executions of this script and CloudWatch's Alarms to get notified whenever the script runs into problems.
The script puts a data point on a CloudWatch metric after every successful backup:
mon-put-data --namespace Backup --metric-name $metric --unit Count --value 1
I have an alarm that goes to ALARM state whenever the statistic "Sum" on the metric is less than 2 in a 6-hour period.
In order to test this setup, after a day, I stopped putting data in the metric (ie, I commented out the mon-put-data command). Good, eventually the alarm went to ALARM state and I got an email notification, as expected.
The problem is that, some time later, the alarm wen back to the OK state, however there's no new data being added to the metric!
The two transitions (OK => ALARM, then ALARM => OK) have been logged and I reproduce the logs in this question. Note that, although both show "period: 21600" (ie, 6h), the second one shows a 12-hour time span between startDate and queryDate; I see that this might explain the transition, but I cannot understand why CloudWatch is considering a 12-hour time span to calculate a statistic with a 6-hour period!
What am I missing here? How to configure the alarms to achieve what I want (ie, get notified if backups are not being made)?
{
"Timestamp": "2013-03-06T15:12:01.069Z",
"HistoryItemType": "StateUpdate",
"AlarmName": "alarm-backup-svn",
"HistoryData": {
"version": "1.0",
"oldState": {
"stateValue": "OK",
"stateReason": "Threshold Crossed: 1 datapoint (3.0) was not less than the threshold (3.0).",
"stateReasonData": {
"version": "1.0",
"queryDate": "2013-03-05T21:12:44.081+0000",
"startDate": "2013-03-05T15:12:00.000+0000",
"statistic": "Sum",
"period": 21600,
"recentDatapoints": [
3
],
"threshold": 3
}
},
"newState": {
"stateValue": "ALARM",
"stateReason": "Threshold Crossed: 1 datapoint (1.0) was less than the threshold (2.0).",
"stateReasonData": {
"version": "1.0",
"queryDate": "2013-03-06T15:12:01.052+0000",
"startDate": "2013-03-06T09:12:00.000+0000",
"statistic": "Sum",
"period": 21600,
"recentDatapoints": [
1
],
"threshold": 2
}
}
},
"HistorySummary": "Alarm updated from OK to ALARM"
}
The second one, which I simple cannot understand:
{
"Timestamp": "2013-03-06T17:46:01.063Z",
"HistoryItemType": "StateUpdate",
"AlarmName": "alarm-backup-svn",
"HistoryData": {
"version": "1.0",
"oldState": {
"stateValue": "ALARM",
"stateReason": "Threshold Crossed: 1 datapoint (1.0) was less than the threshold (2.0).",
"stateReasonData": {
"version": "1.0",
"queryDate": "2013-03-06T15:12:01.052+0000",
"startDate": "2013-03-06T09:12:00.000+0000",
"statistic": "Sum",
"period": 21600,
"recentDatapoints": [
1
],
"threshold": 2
}
},
"newState": {
"stateValue": "OK",
"stateReason": "Threshold Crossed: 1 datapoint (3.0) was not less than the threshold (2.0).",
"stateReasonData": {
"version": "1.0",
"queryDate": "2013-03-06T17:46:01.041+0000",
"startDate": "2013-03-06T05:46:00.000+0000",
"statistic": "Sum",
"period": 21600,
"recentDatapoints": [
3
],
"threshold": 2
}
}
},
"HistorySummary": "Alarm updated from ALARM to OK"
}