4
votes

My app service has failed to scale-in after scaling-out. This seems to be a pattern I've been trying to troubleshoot for several months.

I've tried the following but none have worked:

My scale condition is based on CPU and memory. However, I've never seen CPU go past 12%, so I'm assuming it's actually scaling based on memory.

  1. Set the scale out condition to memory over 90% over a 5 minute average with 10 min. cooldown and scale in condition for memory under 70% over a 5 minute average. This doesn't seem to make sense since if my memory utilization is already at 90%, I'm really having underlying memory leaks and should have already scaled out.

  2. Set the scale out condition to memory over 80% over a 60 minute average with 10 min. cooldown and scale in condition for memory under 60% over a 5 minute average. This makes more sense, as I've seen memory usage burst over a few hours only to drop.

enter image description here

Expected behavior: App service autoscaling will reduce instance count after 5 minutes where memory usage drops below 60%.

Question:

What is the ideal threshold on a metric to scale smoothly by if my baseline CPU remains roughly at an average of 6% and memory at 53%? Meaning, what is the best minimum values to scale in by and best max values to scale out without worrying about anti-patterns such as flapping? A larger threshold of 20% difference makes more sense to me.

Alternative solution:

Given the amount of troubleshooting involved with what's marketed as as simple as "push button scaling", makes it almost not even worth the headache of the configuration vagueness (you can't even use IIS metrics like connection count without a custom powershell script!). I'm considering disabling autoscaling because of its unpredictability and just keep 2 instances running for automatic load balancing and scale manually.

Autoscale Configuration:

{
    "location": "East US 2",
    "tags": {
        "$type": "Microsoft.WindowsAzure.Management.Common.Storage.CasePreservedDictionary, Microsoft.WindowsAzure.Management.Common.Storage"
    },
    "properties": {
        "name": "CPU and Memory Autoscale",
        "enabled": true,
        "targetResourceUri": "/redacted",
        "profiles": [
            {
                "name": "Auto created scale condition",
                "capacity": {
                    "minimum": "1",
                    "maximum": "10",
                    "default": "1"
                },
                "rules": [
                    {
                        "scaleAction": {
                            "direction": "Increase",
                            "type": "ChangeCount",
                            "value": "1",
                            "cooldown": "PT10M"
                        },
                        "metricTrigger": {
                            "metricName": "MemoryPercentage",
                            "metricNamespace": "",
                            "metricResourceUri": "/redacted",
                            "operator": "GreaterThanOrEqual",
                            "statistic": "Average",
                            "threshold": 80,
                            "timeAggregation": "Average",
                            "timeGrain": "PT1M",
                            "timeWindow": "PT1H"
                        }
                    },
                    {
                        "scaleAction": {
                            "direction": "Decrease",
                            "type": "ChangeCount",
                            "value": "1",
                            "cooldown": "PT5M"
                        },
                        "metricTrigger": {
                            "metricName": "MemoryPercentage",
                            "metricNamespace": "",
                            "metricResourceUri": "/redacted",
                            "operator": "LessThanOrEqual",
                            "statistic": "Average",
                            "threshold": 60,
                            "timeAggregation": "Average",
                            "timeGrain": "PT1M",
                            "timeWindow": "PT10M"
                        }
                    },
                    {
                        "scaleAction": {
                            "direction": "Increase",
                            "type": "ChangeCount",
                            "value": "1",
                            "cooldown": "PT5M"
                        },
                        "metricTrigger": {
                            "metricName": "CpuPercentage",
                            "metricNamespace": "",
                            "metricResourceUri": "/redacted",
                            "operator": "GreaterThanOrEqual",
                            "statistic": "Average",
                            "threshold": 60,
                            "timeAggregation": "Average",
                            "timeGrain": "PT1M",
                            "timeWindow": "PT1H"
                        }
                    },
                    {
                        "scaleAction": {
                            "direction": "Decrease",
                            "type": "ChangeCount",
                            "value": "1",
                            "cooldown": "PT5M"
                        },
                        "metricTrigger": {
                            "metricName": "CpuPercentage",
                            "metricNamespace": "",
                            "metricResourceUri": "/redacted",
                            "operator": "LessThanOrEqual",
                            "statistic": "Average",
                            "threshold": 40,
                            "timeAggregation": "Average",
                            "timeGrain": "PT1M",
                            "timeWindow": "PT10M"
                        }
                    }
                ]
            }
        ],
        "notifications": [
            {
                "operation": "Scale",
                "email": {
                    "sendToSubscriptionAdministrator": false,
                    "sendToSubscriptionCoAdministrators": false,
                    "customEmails": [
                        "redacted"
                    ]
                },
                "webhooks": []
            }
        ],
        "targetResourceLocation": "East US 2"
    },
    "id": "/redacted",
    "name": "CPU and Memory Autoscale",
    "type": "Microsoft.Insights/autoscaleSettings"
}
3

3 Answers

6
votes

For the CpuPercentage metric you have a SCALE UP action when it goes beyond 60 and a scale down action when it goes below 40 and the difference between the two is very less. This can cause a behavior described as Flapping and this will cause AutoScale's scale in action not to kick in. Similar issue is the MemoryPercent rule that you have configured.

You should have a difference of at-least 40 between your scale up and scale in threasholds to avoid flapping. More details on Flapping are in https://docs.microsoft.com/en-us/azure/monitoring-and-diagnostics/insights-autoscale-best-practices#choose-the-thresholds-carefully-for-all-metric-types (search for the word Flapping)

2
votes

I have exactly the same problem and I've come to believe that autoscaling back to one instance like we want to do it currently not possible.

My current workaround is to scale in to 1 instance with a second profile that repeats every day between 23:55 and 00:00.

Just to reiterate the problem. I have the following scenario. It is basically identical to yours.

  • Memory baseline of the App Service is 50%
  • Scale out 1 instance when avg(memory) > 80%
  • Scale in 1 instance when avg(memory) < 60%

Scaling out from 1 instance to 2 instances will work correctly when the average memory percentage exceeds 80%. But scaling in to 1 instance will never work because the memory baseline is too high.

After reading the Best Practices, my understanding is that when scaling in, it will estimate the resulting memory percentage and check if no scale out rule is triggered.

So if the average memory percentage drops to 50% for two instances the scale in rule is triggered and it will estimate the resulting memory usage to be 2 * 50% / 1 = 100% which will of course trigger the scale out rule and thus it will not scale in.

It should however work when scaling from 3 to 2 instances: 3 * 50% / 2 = 75% which is smaller than the 80% of the scale out rule.

0
votes

I have the same issue here. My App need only one instance and I have a auto scaling configuration like:

Scale out
When br-empresa (Average) CpuPercentage > 85 Increase instance count by 1
Or Br-Empresa (Average) MemoryPercentage > 85 Increase instance count by 1

Scale in
When br-empresa (Average) CpuPercentage <= 75 Decrease instance count by 1
And Br-Empresa (Average) MemoryPercentage <= 75 Decrease instance count by 1

And the baseline for memory is 60%.

The Scale Out logic works pretty. But the app never scale in even if the memory falls to 60%. (60% * 2) / 1 = 120%

For memory or cpu metrics the actual flapping estimate doesn't make sense.