I have been having issues with Azure SQL often maxing out DTU at random times with no change in query load. I generally see a big jump in dataIO followed shortly after by a jump in CPU. DataIO will then subside but CPU usage appears to get stuck and response times get excessive to the point where querying code starts to timeout. The only way that I have found to correct the problem is to scale up or down, let it settle, then scale back to the original setting.
We are running an S4 size, however, an S2 would be sufficient except during what appears to be data center maintenance periods. As I mentioned when it gets "stuck" I scale down to an S2, let it stabilize, and then back to an S4 and everything is back to normal. We are also running 4 read only replicas and the code switches between the replicas when it detects a problem which gives us time to do the scale trick and get things back to normal. This works well until the read/write goes sideways, and then there is no place to switch to.
We have spent countless hours pouring through best practices and with Azure support and at one point we were told that there would be a patch to take care of it. It did seem that they did something for a few months where we would see it get stuck for only about 15 minutes and then return to normal, but within the last month it's back to getting into this condition until we scale. During these periods I've wanted to restart the server but scaling seems to be the next best thing.
Azure SQL 24hr graph, running normal, DTU jump and stuck, then after scale return to normal
Does anybody know what could be the cause of this and what scaling really does at a server level?
EDIT:
These events usually start with a high, but not necessarily maxed data I/O but that then drops off followed by high CPU usage that just keeps on going with no other abnormal activity to account for it. One thing I just thought to mention, after reading the comments, is that when we move the load from one secondary, experiencing this issue, to another, everything drops to zero on the initial database but the one we switch to only increases to our normal 5% - 8% DTU utilization. if we then move the traffic back to the first one, the first one goes "stuck" again and the other drops back down to pre-switch utilization. It behaves as if the scale setting got dropped down but there is no indication that it happened in the portal.
In regards to index rebuilds, we have automated code running in an azure timer triggered function that checks the fragmentation on a different index each night (in the early hours of the morning) and if there is sufficient fragmentation it starts a rebuild. The longest rebuild runs for about an hour and it takes about 17 days to work through all the indexes. If they don't need to be rebuilt it will skip to the next.