I have a rather high-load deployment on Azure: 4 Large instances serving about 300-600 requests per second. Under normal conditions: "Average Response Time" is 70 to 150ms, but sometimes it may grow up to 200-300ms, but it's absolutely OK.
Though, one or two times per day (not at "Rush Hours") I see such picture on the Web Site Monitoring tab:
So, number of requests per minute significantly drops, average response time is growing on to 3 minutes, and after a while – everything comes back to normal.
During this "Blackout" there is only 0.1% requests being dropped (Http Server Errors with timeout), other requests just wait in queue and are normally processed after few minutes. Though, not all clients are ready to wait :-(
Memory usage is under 30% all the time, CPU usage is only up to 40-50%.
What I've already checked?:
- Traces for timed-out requests: they did timed out at random locations.
- Throttling for Azure Storage and other components used: no throttling at all.
- I also tried to route all traffic through CloudFlare: and saw the same problems.
What could be the reason for such problems? What may I check next?
Thank you all in advance!
Update 1: BenV proposed good thing to try, but unfortunately it showed nothing :-(
I configured processes recycling every 500k requests and also added worker nodes, so CPU utilization is now less than 40% all day long, but blackouts still appear.
Update 2: Project uses ASP.Net MVC 4.