We have an ASP.NET web application deployed in Azure App Service and using Application Insights for logging and New Relic as a monitoring tool.
Often I am investigating slow response times and what I find the most difficult is identifying the root cause.
In New Relic, I can see all the endpoints got slower:
But there was probably one endpoint which got hit by expensive requests, leading to a CPU utilization spike, manifesting as slow response times for every endpoint.
Sometimes it's pretty clear - one endpoint might get a burst of traffic, so it stands out. But often times it's not about the throughput, it's about what those requests look like.
Are there some established analytical or statistical methods of figuring out the root cause in cases like this? I can imagine it might involve getting a profiler snapshot of the running application, analyzing the web server logs etc.