We are using Prometheus and Grafana for our monitoring and we have a panel for response time however I noticed after while the metrics are missing and there are a lots of gap in the panel (only for response time panel) and they comeback as soon as I restart the app (redeploying it in openshift). the service has been written in Go and the logic for the gathering response time is quite simple.
we declared the metric
var (
responseTime = promauto.NewSummaryVec(prometheus.SummaryOpts{
Namespace: "app",
Subsystem: "rest",
Name: "response_time",
}, []string{
"path",
"code",
"method",
})
)
and fill it in our handler
func handler(.......) {
start := time.Now()
// do stuff
....
code := "200"
path := r.URL.Path
method := r.Method
elapsed := float64(time.Since(start)) / float64(time.Second)
responseTime.WithLabelValues(path, code, method).Observe(elapsed)
}
and query in the Grafana panel is like:
sum(rate(app_rest_response_time_sum{path='/v4/content'}[5m]) /
rate(app_rest_response_time_count{path='/v4/content'}[5m])) by (path)
can anyone explain what do we do wrong or how to fix this issue? is it possible that we facing some kind of overflow issue (the average RPS is about 250)? I'm suspecting this because this happen more often to the routes with higher RPS and response time!