1
votes

I have a CloudFormation stack set up which creates an autoscaling group (ASG) along with some other items that aren't relevant.

There is an update policy on the ASG as follows:

    UpdatePolicy:
      AutoScalingReplacingUpdate:
        WillReplace: 'false'
      AutoScalingScheduledAction:
        IgnoreUnmodifiedGroupSizeProperties: 'true'
      AutoScalingRollingUpdate:
        MinInstancesInService: '0'
        MinSuccessfulInstancesPercent: '50'
        MaxBatchSize: '2'
        PauseTime: PT10M
        WaitOnResourceSignals: 'true'

As part of our release process we updating the launch configuration in CloudFormation. This triggers the ASG to update, which is desired.

There is a life-cycle hook with a 600 second timeout value set up to prevent the EC2 instance from going InService until a few checks are done. If these checks fail I send an error signal back to ASG and send an ABANDON to the lifecycle-hook.

/opt/aws/bin/cfn-signal -e 1 --stack ${AWS::StackId} --resource MyASG --region ${AWS::Region}

INSTANCE_ID=$(curl -s http://169.254.169.254/latest/meta-data/instance-id)
ASG_NAME=$(aws ec2 --region ${AWS::Region} describe-tags --filters Name=resource-type,Values=instance Name=resource-id,Values=$(/opt/aws/bin/ec2-metadata -i | cut -d\: -f2 | tr -d '[:space:]') Name=key,Values='aws:autoscaling:groupName' | jq '.Tags[] |  .Value' -r)
HOOK_NAME=$(aws autoscaling describe-lifecycle-hooks --auto-scaling-group-name $ASG_NAME --region ${AWS::Region} |jq -r '.LifecycleHooks[0].LifecycleHookName')
aws autoscaling complete-lifecycle-action --lifecycle-hook-name $HOOK_NAME --auto-scaling-group-name $ASG_NAME --lifecycle-action-result $1 --instance-id $INSTANCE_ID --region ${AWS::Region}

This works in that the EC2 instance is canceled and terminated. The problem I'm having is that ASG in the CloudFormation stack continues to sit in UPDATE_IN_PROGRESS for an hour before it fails with a "Group did not stabilize" error and everything starts to roll back.

Since the PauseTime is set to "PT10M", I would expect it to wait at most 10 minutes and start rolling back as soon as the cfn-signal error signal is sent.

I'm unable to determine why this the stack is waiting an hour. Any ideas here?

2

2 Answers

0
votes

Considering your use case, you could remove the AutoScalingReplacingUpdate property from the ASG. As far as I know, AutoScalingReplacingUpdate and AutoScalingRollingUpdate are usually mutually exclusive. This might explain why the PT10M is not taken into account.

Also, the PauseTime is a the upper time limit for newly started instance to trigger SUCCESS signal. I would maybe give some leeway, maybe a minute or two, for the ABANDON lifecycle event to occur.

0
votes

I actually figured out my problem. The problem is that AWS apparently only considers an update a "rolling update" if the number of instances InService at the time the update starts are greater than 0 (or maybe equal to the number of desired instances).

The instances in this case were being shut down first to disconnect them from the RDS while database changes were being made. As such AWS wasn't treating this as a rolling update. It was treating it as just an update. There doesn't appear to be any way to signal a regular update has failed or a way to change the timeout time. As such it just waited an hour.

There's two possible solutions here:

  1. After doing the database maintenance, bring the instances back up before initiating the ASG update. It seems kind of silly to do that since the instances will just be shut down immediately again. Also it appears the rolling update tries multiple times and the timeout value resets each time.

  2. Have the cfinit script send do a "cancel-update-stack" on the stack. That is a rather hacky way of handling this case, but it should work.

Either way the answer is that CloudFormation wasn't doing a rolling update so the AutoScalingRollingUpdate block wasn't be used.