For some unknown reason today and only for about 10 minutes, one of our Azure web apps had a meltdown related to the availability of its Azure SQL database. The web app gave a YSOD that said "the wait operation timed out." Elmah showed other errors like "login failed" and "A severe error occurred on the current command. The results, if any, should be discarded."
Looking at the app and database load I don't really see why. The database was set at an S0 level, and we increased it to S1, but even that seems strange. We rarely have more than 50% DTU utilization on the database, and the web app peaks at like 35 simultaneous requests. Overall it just doesn't seem like a load that should be outside of the capabilities of the S0 database.
All that sucks, yeah, but the biggest question is how do I even troubleshoot this? It's clearly a DB issue, but given the low load I don't know why. I certainly don't want to upgrade to a $300+ monthly Premium level for an app of this size.
Is there logging I can set up to figure this out? Some way to look back at what happened and draw definitive conclusions on how to prevent it from happening again?