0
votes

We are using Prometheus and Grafana for our monitoring and we have a panel for response time however I noticed after while the metrics are missing and there are a lots of gap in the panel (only for response time panel) and they comeback as soon as I restart the app (redeploying it in openshift). the service has been written in Go and the logic for the gathering response time is quite simple.

we declared the metric

var (
    responseTime = promauto.NewSummaryVec(prometheus.SummaryOpts{
        Namespace: "app",
        Subsystem: "rest",
        Name:      "response_time",
    }, []string{
        "path",
        "code",
        "method",
    })
)

and fill it in our handler


func handler(.......) {
        start := time.Now()
        // do stuff
        ....

        code := "200"
        path := r.URL.Path
        method := r.Method
        elapsed := float64(time.Since(start)) / float64(time.Second)
        responseTime.WithLabelValues(path, code, method).Observe(elapsed)
 
}

and query in the Grafana panel is like:

sum(rate(app_rest_response_time_sum{path='/v4/content'}[5m]) / 
rate(app_rest_response_time_count{path='/v4/content'}[5m])) by (path)

but the result is like this!! enter image description here

can anyone explain what do we do wrong or how to fix this issue? is it possible that we facing some kind of overflow issue (the average RPS is about 250)? I'm suspecting this because this happen more often to the routes with higher RPS and response time!

1
Do you see the same lines when you run the query in Prometheus? - Tomy8s
No, the graph is same in Prometheus and Grafana - Kian Ostad

1 Answers

0
votes

Prometheus records the metrics continuously normally and if you query it, it returns all the metrics it collected for the time you queried.

If there is no metric when you query, that has typically three reasons:

  • the metric was not there (it happens when the instance restarts and you have a dynamic set of labels and there was no request yet for the label value you queried (in your case there was no query for path='/v4/content'). In such case you should see other metrics of the same job (at least up).
  • Prometheus had problems storing the metrics. (see the log files of prometheus for that timeframe).
  • Prometheus was down for that timeframe and therefore did not collect any metrics. (In that case you should have no metrics at all for that timeframe.