1
votes

We have deployed our solution for high availability using Azure Traffic manager with default settings.

Selected routing method for us is Performance.

We expected that as soon as the primary server is down, the users are transferred to the secondary server. But unfortunately there is a 30 sec delay. For those 30 seconds in our testing we found that users are getting not responding issues and requests time out. It takes almost a minute to get back everything at work. Azure Traffic Manager with 30 second TTL Generally we do not observe these dropouts in Facebook or Microsoft sites which definitely maintain a solution for high availability.

Do we need to code in our application to handle these dropouts gracefully, like showing a dialog on client side that we will be back soon etc? What could be the best solution so that user experience is seamless.

2
Are you running multiple instances of your Web App? If so, is the Traffic Manager failover solution just to protect you in the event of a total Azure data center outage?Rob Reagan
For Failover, you want to configure Priority algorithm, not Performance. Also, do you have multiple instances of the Web App, or do you have multiple Web Apps hosting your site?Chris Pietschmann

2 Answers

3
votes

Because Azure Traffic Manager is a DNS based load balancer, the client has to wait for the TTL on the DNS entry to pass before it re-queries the DNS. That is why you are having your problem. Traffic manager doesn't manage the communication per se, just which server your client will communicate to via DNS

Facebook and Microsoft are using a load balancer at a deeper level of protocol (like balancing on an ip address) so as soon as one node drops out, the load balancer can switch to another since it is receiving and redirecting all traffic.

If you can switch to a Azure Load Balancer (not sure of the name) that will solve your problem. Otherwise you'd have to shorten your TTL or code something to flush your dns cache and try again.

0
votes

The problem with Azure Load Balancer or Application Gateway is that they don't work across data centers rather only Azure Traffic Manager hence the web application has to manage the errors them selves and then refresh after a while (to ensure the TTL is expired) to redirect to the new server. https://docs.microsoft.com/en-us/azure/load-balancer/load-balancer-overview