6
votes

I am planning to use Azure Traffic manager to do a failover of my app running on one Azure zone to Azure zone. I need some suggestion, if that is the correct approach to do a failover ? We have seen issue with Azure that, most of the services in one region goes down for few hours. Although I understand that Azure traffic manager is not associated with the region. But is it possible that Azure traffic manager goes down or that traffic manager endpoint is not reachable although my backend webapp is reachable?

If I am planning to use Azure traffic manager, what are other problems I should be worried about ?

5

5 Answers

5
votes

I've been working with TM for some time now, so here are a few issues I haven't seen mentioned before:

If your service allows Keep-Alive, then your DNS entry will be ignored as long as the connection remains open. I've seen some exceptionally odd behavior result from this, including users being stuck on a fallback page since they kept using the connection, causing it to remain open indefinitely. If you have access to IIS Manager, you can force Keep-Alive to be false.

  • Browser DNS Caching

Most browsers have their own DNS cache, and very few honor DNS Time To Live. In my experience Chrome is pretty responsive, with IE and Edge having significant delays if you need them to rollover quickly. I've heard that Opera is particularly bad.

  • Other DNS Caching

Even if you're not accessing your service through a browser, other components can have DNS caches, and some of them will allow you to manage the cache yourself. This can in theory even depend on ISP's DNS caching, though reports on the magnitude of this vary significantly.

3
votes

Traffic Manager works at the DNS level, which itself is replicated. However, even then, you should still build in redundancy into your solution.

Take a look at the Azure Architecture Center under "Make all things redundant" and you will see a recommendation for Traffic Manager:

consider adding another traffic management solution as a failback. If the Azure Traffic Manager service fails, change your CNAME records in DNS to point to the other traffic management service.

3
votes

The Traffic Manager internal architecture is resilient to the failure of any single Azure region. So, even if a region fails, Traffic Manager should stay up. That applies to all Traffic Manager components: control plane, endpoint monitoring, and DNS name servers.

Since Traffic Manager works at the DNS level, it doesn't have an 'endpoint' that proxies your traffic--it uses DNS to direct clients to the appropriate endpoint, and clients then connect to those endpoints directly. Thus, an unreachable endpoint is an application problem, not a Traffic Manager problem.

That said, if the Traffic Manager DNS name servers are down, you have a serious problem. You DNS resolution path will fail and your customers will be impacted. The only solution is to either accept the risk (small, but can never be zero) or have a plan in place to use another DNS system, either in parallel or failover. This is not a limitation of Traffic Manager; you could say the same about any DNS-based traffic management system.

The earlier answer from DornaDigital is very good (other than the first point which suggests DNS caching will protect you through a name server outage--it won't). It covers some important points. In short, DNS-based failover works well for new sessions. Existing clients may have to refresh or even close their browser and reconnect.

0
votes

I also agree with the details provided dornadigital.

There are considerations for front end applications as well. The browsers all have different thresholds for how long they maintain persistent connections. Chromium, for example, currently maintains a connection unless there is inactivity for 300 seconds.

In our web applications, we are detecting the failover by the presence of a certain number of failed requests to the endpoint. After requests begin failing, we pause requests for 301 seconds to allow the connection to reset. This allows the DNS change from the traffic manager to be applied to subsequent requests. We pop up a snackbar to indicate to the user that we are having an issue and display the count down when requests will resume. Similar to Gmail when it has an issue connecting to their servers.

I hope that gives you one idea on how to build some redundancy into your web apps.

0
votes

I disagree with Jonathan as his understanding of the resiliency of the Traffic Manager service is in disagreement with Microsoft's own documentation on the subject.

When you provision Azure Traffic Manager, you select a region in which to deploy the service. I (correctly) inferred this to assert if said region were to fail, the Traffic Manager service could also be impacted and in turn, your application solution would not properly fail over to the secondary region.

According to Microsoft's Azure Application Architecture Guide, under "Make all things redundant", a customer should deploy Traffic Manager into more than one region:

Include redundancy for Traffic Manager. Traffic Manager is a possible failure point. Review the Traffic Manager SLA, and determine whether using Traffic Manager alone meets your business requirements for high availability. If not, consider adding another traffic management solution as a failback. If the Azure Traffic Manager service fails, change your CNAME records in DNS to point to the other traffic management service.

Azure Application Architecture Guide - Make all things redundant

My thought and intention is to not deploy Traffic Manager within the primary service region, but instead to deploy it into the secondary (failover region) and a tertiary (3rd) region as a backup.