7
votes

We have an Azure SQL database. Up until a few weeks ago, we were set at 10 DTUs (S0). Recently, we've gotten more SQL timeout errors, prompting us to increase our DTUs to 50 (S2). We get the errors less frequently, but still on occasion. When we get these timeouts, we see spikes on the Resource graph hitting 100%. Drilling into that, it's generally Data I/O operations that are making it spike. But when we check Query Performance Insight, none of the listed queries show that they're using that much resources.

Another thing to note is that our database has grown steadily in size. It is now about 19 GB, and the majority (18 GB) of that comes from one table that has a lot of long JSON strings in it. The timeout errors generally do happen on a certain query that has several joins, but they do not interact with the table with the long strings.

We tested making a copy of the database and removing all the long strings, and it did not get any timeouts at 10 DTU, but performed the same as the database with all the long strings at 50 DTU as far as load times.

We have rebuilt our indexes and, though it helped, we continue to experience timeout errors.

Given that the query that gets timeouts is not touching the table with long strings, could the table with long strings still be the culprit for DTU usage? Would it have to do with SQL caching? Could the long strings be hogging the cache and causing a lot of data I/O? (They are accessed fairly frequently too.)

1

1 Answers

1
votes

The strings can definitely exhaust your cache budget if they are hot (as you say they are). When the hot working set exceeds RAM cache size performance can fall off a cliff (10-100x). That's because IO is 10-1000x slower than RAM access. This means that even a tiny decrease in cache hit ratio (such as 1%) can multiply into a big performance loss.

This cliff can be very steep. One moment the app is fine, the next moment IO is off the charts.

Since Azure SQL Database has strict resource limits (as I hear and read) this can quickly exhaust the performance that you bought bring on throttling.

I think the test you made kind of confirms that the strings are causing the problem. Can you try to segregate the strings somewhere else? If they are cold move them to another table. If they are hot move them to another datastore (database or NoSQL). That way you can likely move back to a much lower tier.