14
votes

I'm trying to get AutoScalingRollingUpdate to work on my autoscaling group, by bringing online new instances, then only once the new instance(s) are accepting traffic, terminating the old instances. It seems like AutoScalingRollingUpdate is designed for this purpose.

I have the HealthCheckType of my AutoScalingGroup set to 'ELB'. I also have the HealthCheck on the ELB set to require:

  • 3 successful requests to / for "healthy"
  • 10 unsuccessful requests to / for "unhealthy"
  • no grace period (zero, 0)

Now, from the ELB's perspective, when new instances come online, they are not InService for several minutes, which is what I expect. However, from the AutoScalingGroup's perspective, they are almost immediately being considered InService, and as such, my AutoScalingGroup is taking healthy instances out of service before the new instances are actually ready to receive traffic. I'm confused why the ASG thinks the instances are healthy before the ELB does, when the HealthCheckType is explicitly set to 'ELB'.

I've tried setting a grace period, but this doesn't change anything at all. In fact, I removed the grace period of 300 seconds because I thought maybe instances were implicitly "InService" during the grace period or something.

I know I can set a PauseTime on the rolling update policy, but that is fragile, because sometimes failures happen when instances come online and they get nuked and replaced before they ever finish provisioning, so sometimes, the PauseTime window may be exceeded. Also, I'd like to minimize the amount of time my app is running two different versions at the same time.

    ... ELB stuff ...

    "HealthCheck": {
      "HealthyThreshold": "3",
      "UnhealthyThreshold": "10",
      "Interval": "30",
      "Timeout": "15",
      "Target": {
        "Fn::Join": [
          "",
          [
            {"Fn::Join": [":", ["HTTP", {"Ref": "hostPort"}]]},
            {"Ref": "healthCheckPath"}
          ]
        ]
      }
    },

   ... ASG Stuff ...

  {
    ... snip ...

    "HealthCheckType": "ELB",
    "HealthCheckGracePeriod": "0",
    "Cooldown": "300"
  },
  "UpdatePolicy" : {
    "AutoScalingRollingUpdate" : {
      "MinInstancesInService" : "1",
      "MaxBatchSize" : "1"
    }
  }
2
Review your code, I think the issue is not in ASG AutoScalingGroup setting, it is in your ELB setting.` "HealthCheckGracePeriod": "0",` gives me strange feeling, could you change to 300. After that, ELB will take care of the availability, not ASG. ASG will scale up and down depend on ELB status.BMW
Even with a grace period, the ASG considers the instance InService before the ELB does. It seems like a bug in CloudFormation to me. I actually set that time down to zero in an attempt to fix the problem.d11wtq
Are you sure the Load Balancer reports the instance as being 'unhealthy' ? Where do you see that status ? The console sometime is not updated immediately. Is the AWS CLI giving you the same status ? What is the HTTP status code of your app while it starts ? Does it return HTTP 200 OK ? You can check this using 'curl -I ...'Sébastien Stormacq
Yes, I'm positive. The actual wording from the ELB is "OutOfService", while the ASG says "InService". The application is actually just a static website using Apache running inside Docker. The "several minutes" is just the time it takes to pull down the Docker image. During that time, port 80 isn't even open.d11wtq

2 Answers

23
votes

First, from our experience with CloudFormation the ASG HealthCheckType and HealthCheckGracePeriod are leveraged primarily outside the scope of CloudFormation events. These properties come into play anytime a new instance is added to the ASG. This can be during a CloudFormation update, but also during Auto Scaling events or during a self-healing event. In the latter cases it is important to set the HealthCheckGracePeriod to a value that gives the new instance sufficient time to come online before considering the ELB health checks.

It seems the capability you are most interested in is the UpdatePolicy that is invoked when you run a CloudFormation update with a modified Launch Configuration. The magic property is the WaitOnResourceSignals which forces the ASG to wait for a success signal before considering the update a success.

  "UpdatePolicy" : {
    "AutoScalingRollingUpdate" : {
      "MinInstancesInService" : "1",
      "MaxBatchSize" : "1",
      "PauseTime" : "PT15M",
      "WaitOnResourceSignals" : "true"
    }
  },

When the WaitOnResourceSignals property is set to true, the PauseTime property becomes a timeout. If the ASG does not receive a signal within the PauseTime of 15 minutes, the update is considered a failure and the new instance is terminated. As soon as the ASG receives a success signal, the ASG health check comes into play, unless the HealthCheckGracePeriod has not yet expired. We typically set the HealthCheckGracePeriod to the same value as the PauseTime. This ensures that we never begin using the ELB health check before the instance has had a chance to send a signal or reach the PauseTime timeout. http://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-attribute-updatepolicy.html

Typically, a success signal is sent to the ASG following the cfn-init bootstrapping script from within the UserData of the ASG Launch Configuration.

"UserData"       : { "Fn::Base64" : { "Fn::Join" : ["", [
     "#!/bin/bash -xe\n",
     "yum update -y aws-cfn-bootstrap\n",

     "/opt/aws/bin/cfn-init -v ",
     "         --stack ", { "Ref" : "AWS::StackName" },
     "         --resource LaunchConfig ",
     "         --configsets full_install ",
     "         --region ", { "Ref" : "AWS::Region" }, "\n",

     "/opt/aws/bin/cfn-signal -e $? ",
     "         --stack ", { "Ref" : "AWS::StackName" },
     "         --resource WebServerGroup ",
     "         --region ", { "Ref" : "AWS::Region" }, "\n"
]]}}

This is sufficient for many cases, but sometimes the instance may still not be ready when we send the success signal back to the ASG. For example, we may want to wait on a background process to load data or wait for our application server to start. This is especially true if our ELB health check targets a URL that requires our application to be running. In these cases we want to delay the success signal until our instance is ready. Here is an example of how to create a Launch Configuration configSet to delay the signal until the ELB API returns an "InService" status for the instance.

  "verify_instance_health" : {
    "commands" : {
      "ELBHealthCheck" : {
        "command" : { "Fn::Join" : ["", [ 
          "until [ \"$state\" == \"\\\"InService\\\"\" ]; do ",
          "  state=$(aws --region ", { "Ref" : "AWS::Region" }, " elb describe-instance-health ",
          "              --load-balancer-name ", { "Ref" : "ElasticLoadBalancer" }, 
          "              --instances $(curl -s http://169.254.169.254/latest/meta-data/instance-id) ",
          "              --query InstanceStates[0].State); ",
          "  sleep 10; ",
          "done"
        ]]}
      }
    }
  }

See this discussion forum for more information and a complete example using the ELB health check - https://forums.aws.amazon.com/ann.jspa?annID=2741

Note: These examples also require that you use the ASG CreationPolicy attribute to receive the signals during ASG creation. In the past, the WaitCondition and WaitConditionHandle resources were used to receive signals, but these are no longer recommended. The Count attribute is the number of signals that should be received at creation. This value should equal the ASG MinSize number.

  "CreationPolicy" : {
    "ResourceSignal" : {
      "Timeout" : "PT15M",
      "Count"   : "2"
    }
  },

http://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-attribute-creationpolicy.html

4
votes

I realize this is a little late but perhaps it might save someone some time and effort.

If using elbv2 then the command would look as follows. Please be advised of the "=" versus "==" as this tripped me up for hours. Ubuntu 16 runs commands as /bin/sh and not /bin/bash which means [ \"$state\" == \"\\\"healthy\\\"\" ] will never result in true. At least that is my understanding.

"commands": {
  "ELBHealthCheck": {
    "command": {
      "Fn::Join": ["", [
        "until [ \"$state\" = \"\\\"healthy\\\"\" ]; do ",
        "state=$(aws elbv2 describe-target-health ",
        "--region ", {
          "Ref": "AWS::Region"
        },
        " ",
        "--target-group-arn ", {
          "Ref": "ELBRestPublicTargetGroup"
        },
        " ",
        "--targets Id=$(curl -s http://169.254.169.254/latest/meta-data/instance-id) ",
        "--query TargetHealthDescriptions[0].TargetHealth.State); ",
        "echo $(date): [$state] >> /tmp/health.log; ",
        "sleep 10; ",
        "done"
      ]]
    }
  }
}