0
votes

I have been having issues with Azure SQL often maxing out DTU at random times with no change in query load. I generally see a big jump in dataIO followed shortly after by a jump in CPU. DataIO will then subside but CPU usage appears to get stuck and response times get excessive to the point where querying code starts to timeout. The only way that I have found to correct the problem is to scale up or down, let it settle, then scale back to the original setting.

We are running an S4 size, however, an S2 would be sufficient except during what appears to be data center maintenance periods. As I mentioned when it gets "stuck" I scale down to an S2, let it stabilize, and then back to an S4 and everything is back to normal. We are also running 4 read only replicas and the code switches between the replicas when it detects a problem which gives us time to do the scale trick and get things back to normal. This works well until the read/write goes sideways, and then there is no place to switch to.

We have spent countless hours pouring through best practices and with Azure support and at one point we were told that there would be a patch to take care of it. It did seem that they did something for a few months where we would see it get stuck for only about 15 minutes and then return to normal, but within the last month it's back to getting into this condition until we scale. During these periods I've wanted to restart the server but scaling seems to be the next best thing.

Azure SQL 24hr graph, running normal, DTU jump and stuck, then after scale return to normal

Does anybody know what could be the cause of this and what scaling really does at a server level?

EDIT:

These events usually start with a high, but not necessarily maxed data I/O but that then drops off followed by high CPU usage that just keeps on going with no other abnormal activity to account for it. One thing I just thought to mention, after reading the comments, is that when we move the load from one secondary, experiencing this issue, to another, everything drops to zero on the initial database but the one we switch to only increases to our normal 5% - 8% DTU utilization. if we then move the traffic back to the first one, the first one goes "stuck" again and the other drops back down to pre-switch utilization. It behaves as if the scale setting got dropped down but there is no indication that it happened in the portal.

In regards to index rebuilds, we have automated code running in an azure timer triggered function that checks the fragmentation on a different index each night (in the early hours of the morning) and if there is sufficient fragmentation it starts a rebuild. The longest rebuild runs for about an hour and it takes about 17 days to work through all the indexes. If they don't need to be rebuilt it will skip to the next.

2

2 Answers

0
votes

Normally this happens when there is a resource intensive execution. Before scaling, if you haven't done already, I would suggest you to check long running queries from portal and turn on automatic indexing. Similar charts also happen when there is an index rebuild (if you have such a maintenance process) in progress.

0
votes

Throttling is the probably the cause of this issue. When throttling occurs you usually see the symptoms you described, connection timeouts, poor performance.

You can see connection timeouts with the following query:

select * 
from sys.event_log 
where event_type <> 'connection_successful' and
start_time >= CAST(FLOOR(CAST(getdate() AS float)) AS DATETIME)
order by start_time desc

The following query tells you when you need to scale up.

SELECT     
(COUNT(end_time) - SUM(CASE WHEN avg_cpu_percent > 80 THEN 1 ELSE 0 END) * 1.0) / COUNT(end_time) AS 'CPU Fit Percent',
(COUNT(end_time) - SUM(CASE WHEN avg_log_write_percent > 80 THEN 1 ELSE 0 END) * 1.0) / COUNT(end_time) AS 'Log Write Fit Percent',
(COUNT(end_time) - SUM(CASE WHEN avg_data_io_percent > 80 THEN 1 ELSE 0 END) * 1.0) / COUNT(end_time) AS 'Physical Data Read Fit Percent'
FROM sys.dm_db_resource_stats
--service level objective (SLO) of 99.9% <= go to next tier

When avg_log_write_percent is at 100% or near 100% then throthling occurs. Try to implement to scale up to premium tiers before starting IO intensive workloads.

Try to implement batching on your IO workloads to control those DTU spikes. Please read this documentation to know how.