Why is AWS Cloudformation UpdatePolicy PauseTime not being used during an update

Question

I have a CloudFormation stack set up which creates an autoscaling group (ASG) along with some other items that aren't relevant.

There is an update policy on the ASG as follows:

    UpdatePolicy:
      AutoScalingReplacingUpdate:
        WillReplace: 'false'
      AutoScalingScheduledAction:
        IgnoreUnmodifiedGroupSizeProperties: 'true'
      AutoScalingRollingUpdate:
        MinInstancesInService: '0'
        MinSuccessfulInstancesPercent: '50'
        MaxBatchSize: '2'
        PauseTime: PT10M
        WaitOnResourceSignals: 'true'

As part of our release process we updating the launch configuration in CloudFormation. This triggers the ASG to update, which is desired.

There is a life-cycle hook with a 600 second timeout value set up to prevent the EC2 instance from going InService until a few checks are done. If these checks fail I send an error signal back to ASG and send an ABANDON to the lifecycle-hook.

/opt/aws/bin/cfn-signal -e 1 --stack ${AWS::StackId} --resource MyASG --region ${AWS::Region}

INSTANCE_ID=$(curl -s http://169.254.169.254/latest/meta-data/instance-id)
ASG_NAME=$(aws ec2 --region ${AWS::Region} describe-tags --filters Name=resource-type,Values=instance Name=resource-id,Values=$(/opt/aws/bin/ec2-metadata -i | cut -d\: -f2 | tr -d '[:space:]') Name=key,Values='aws:autoscaling:groupName' | jq '.Tags[] |  .Value' -r)
HOOK_NAME=$(aws autoscaling describe-lifecycle-hooks --auto-scaling-group-name $ASG_NAME --region ${AWS::Region} |jq -r '.LifecycleHooks[0].LifecycleHookName')
aws autoscaling complete-lifecycle-action --lifecycle-hook-name $HOOK_NAME --auto-scaling-group-name $ASG_NAME --lifecycle-action-result $1 --instance-id $INSTANCE_ID --region ${AWS::Region}

This works in that the EC2 instance is canceled and terminated. The problem I'm having is that ASG in the CloudFormation stack continues to sit in UPDATE_IN_PROGRESS for an hour before it fails with a "Group did not stabilize" error and everything starts to roll back.

Since the PauseTime is set to "PT10M", I would expect it to wait at most 10 minutes and start rolling back as soon as the cfn-signal error signal is sent.

I'm unable to determine why this the stack is waiting an hour. Any ideas here?

Laurent Jalbert Simard Laurent Jalbert Simard · Accepted Answer · 2019-09-06T19:21:59

Considering your use case, you could remove the AutoScalingReplacingUpdate property from the ASG. As far as I know, AutoScalingReplacingUpdate and AutoScalingRollingUpdate are usually mutually exclusive. This might explain why the PT10M is not taken into account.

Also, the PauseTime is a the upper time limit for newly started instance to trigger SUCCESS signal. I would maybe give some leeway, maybe a minute or two, for the ABANDON lifecycle event to occur.

Why is AWS Cloudformation UpdatePolicy PauseTime not being used during an update

2 Answers