0
votes

I'm using Azure Autoscale feature to process hundreds of files. The system scales up correctly to 8 instances and each instance processes one file at a time.

The problem is with scaling in. Because the scale in rules seem to be based on ALL instances, if I tell it to reduce the instance count back to 1 after an average CPU load of < 25% it will arbitrarily kill instances that are still processing data.

Is there a way to prevent it from shutting down individual instances that are still in use?

1

1 Answers

0
votes

Scale down will remove the highest instance numbers first. For example, if you have WorkerRole_IN_0, WorkerRole_IN_1, ..., WorkerRole_IN_8, and then you scale down by 1, Azure will remove WorkerRole_IN_8 first. Azure has no idea what your code is doing (ie. if it is still processing a file) or if it is finished and ready to shut down.

You have a few options:

  1. If the file processing is quick, you can delay the shutdown for up to 5 minutes in the OnStop event, giving your instance enough time to finish processing the file. This is the easiest solution to implement, but not the most reliable.
  2. If processing the file can be broken up into shorter chunks of work then you can have the instances process chunks until the file is complete. This way it doesn't really matter if an arbitrary instance is shut down since you don't lose any significant amount of work and another instance will pick up where it left off. See https://docs.microsoft.com/en-us/azure/architecture/patterns/pipes-and-filters for a pattern. This is the ideal solution as it is an optimized architecture for distributed workloads, but some workloads (ie. image/video processing) may not be able to break up easily.
  3. You can implement your own autoscale algorithm and manually shut down individual instances that you choose. To do this you would call the Delete Role Instance API (https://msdn.microsoft.com/en-us/library/azure/dn469418.aspx). This requires some external process to be monitoring your workload and executing management operations so may not be a good solution depending on your infrastructure.