7
votes

I have a backup script that runs every 2 hours. I want to use CloudWatch to track the successful executions of this script and CloudWatch's Alarms to get notified whenever the script runs into problems.

The script puts a data point on a CloudWatch metric after every successful backup:

    mon-put-data --namespace Backup --metric-name $metric --unit Count --value 1

I have an alarm that goes to ALARM state whenever the statistic "Sum" on the metric is less than 2 in a 6-hour period.

In order to test this setup, after a day, I stopped putting data in the metric (ie, I commented out the mon-put-data command). Good, eventually the alarm went to ALARM state and I got an email notification, as expected.

The problem is that, some time later, the alarm wen back to the OK state, however there's no new data being added to the metric!

The two transitions (OK => ALARM, then ALARM => OK) have been logged and I reproduce the logs in this question. Note that, although both show "period: 21600" (ie, 6h), the second one shows a 12-hour time span between startDate and queryDate; I see that this might explain the transition, but I cannot understand why CloudWatch is considering a 12-hour time span to calculate a statistic with a 6-hour period!

What am I missing here? How to configure the alarms to achieve what I want (ie, get notified if backups are not being made)?

{
    "Timestamp": "2013-03-06T15:12:01.069Z",
    "HistoryItemType": "StateUpdate",
    "AlarmName": "alarm-backup-svn",
    "HistoryData": {
        "version": "1.0",
        "oldState": {
            "stateValue": "OK",
            "stateReason": "Threshold Crossed: 1 datapoint (3.0) was not less than the threshold (3.0).",
            "stateReasonData": {
                "version": "1.0",
                "queryDate": "2013-03-05T21:12:44.081+0000",
                "startDate": "2013-03-05T15:12:00.000+0000",
                "statistic": "Sum",
                "period": 21600,
                "recentDatapoints": [
                    3
                ],
                "threshold": 3
            }
        },
        "newState": {
            "stateValue": "ALARM",
            "stateReason": "Threshold Crossed: 1 datapoint (1.0) was less than the threshold (2.0).",
            "stateReasonData": {
                "version": "1.0",
                "queryDate": "2013-03-06T15:12:01.052+0000",
                "startDate": "2013-03-06T09:12:00.000+0000",
                "statistic": "Sum",
                "period": 21600,
                "recentDatapoints": [
                    1
                ],
                "threshold": 2
            }
        }
    },
    "HistorySummary": "Alarm updated from OK to ALARM"
}

The second one, which I simple cannot understand:

{
    "Timestamp": "2013-03-06T17:46:01.063Z",
    "HistoryItemType": "StateUpdate",
    "AlarmName": "alarm-backup-svn",
    "HistoryData": {
        "version": "1.0",
        "oldState": {
            "stateValue": "ALARM",
            "stateReason": "Threshold Crossed: 1 datapoint (1.0) was less than the threshold (2.0).",
            "stateReasonData": {
                "version": "1.0",
                "queryDate": "2013-03-06T15:12:01.052+0000",
                "startDate": "2013-03-06T09:12:00.000+0000",
                "statistic": "Sum",
                "period": 21600,
                "recentDatapoints": [
                    1
                ],
                "threshold": 2
            }
        },
        "newState": {
            "stateValue": "OK",
            "stateReason": "Threshold Crossed: 1 datapoint (3.0) was not less than the threshold (2.0).",
            "stateReasonData": {
                "version": "1.0",
                "queryDate": "2013-03-06T17:46:01.041+0000",
                "startDate": "2013-03-06T05:46:00.000+0000",
                "statistic": "Sum",
                "period": 21600,
                "recentDatapoints": [
                    3
                ],
                "threshold": 2
            }
        }
    },
    "HistorySummary": "Alarm updated from ALARM to OK"
}
1

1 Answers

5
votes

This behavior (that your monitor did not transition into the INSFUCCIENT_DATA state is because Cloudwatch considers 'pre-timestamped' metric datapoints and so (for a 6 hour alarm) if no data exists in the current 6 open hour window .. it will take data from the previous 6 hour window (hence the 12 hour timestamp you see above).

To increase the 'fidelity' of your alarm, reduce the alarm period down to 1 Hour/3600s and increase your number of evaluation periods to how many periods you want to alarm on failure for. That will ensure your alarm transitions into INSFUCCIENT_DATA as you expect.

How to configure the alarms to achieve what I want (ie, get notified if backups are not being made)?

A possible architecture for your alarm would be publish 1 if your job is successful, 0 if it failed. Then create an alarm with a threshold of < 1 for 3 - 3600s periods meaning that your alarm will go into ALARM if the job is failing (i.e running .. but failing). If you also set an INSFUCCIENT_DATA action on that alarm then you will also get notified if your job is not running at all.

Hope that makes sense .