0
votes

I have a question regarding PromQL and its query functions rate() and how to use it properly. In my application, I have a thread running, and I use Micrometer's Timer to monitor the thread's runtime. Using Timer gives you a counter with suffix _count and another counter with the sum of the seconds spent with suffix _sum. E.g. my_metric_sum and my_metric_count.

My raw data looks like this (scrape interval 30 s, range vector 5m):

enter image description here

Now according to the docs, https://prometheus.io/docs/prometheus/latest/querying/functions/#rate calculates the per-second average rate of increase of the time series in the range vector (which is 5m here).

Now my question is: why would I want that? The relative change of my execution runtime seems pretty useless to me. In fact, just using sum/count looks more useful as it gives me the avg absolute duration for each moment in time. At the same time, and this is what confused me, in the docs I find

To calculate the average request duration during the last 5 minutes from a histogram or summary called http_request_duration_seconds, use the following expression:

rate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m])

Source: https://prometheus.io/docs/practices/histograms/

But as I understand the docs, it looks like this expression would calculate the per-second average rate of increase of the request duration, ie not how long a request takes on average, but instead how much the request duration has changed on average in the last 5 minutes.

2

2 Answers

1
votes

While I am not familiar with Micrometer Timer, the metric you're describing is of type Summary. It is counting the "events" in _count and summing the events magnitude, like duration, elapsed time and similar, in _sum. If you now perform rate(metric_count[5m]), you'll get the 5m average per second rate of your events. And if you want to know the average duration of these events within 5m window, you do rate(metric_sum[5m]) / rate(metric_count[5m]). If you try dividing metric_sum/metric_count, you'll get all time (since counter reset) average instead of 5m average at some point in time. In a way, it looks a bit funny to use rate() for this. Using increase() seems more intuitive to me, but mathematically it's exactly the same as rate() is just an increase()/range and so these ranges cancel each other out in rate(metric_sum[5m]) / rate(metric_count[5m]).

0
votes

First of all - use the tool that matches your use case.

Second - whatever you choose, validate the data. And better do it now than during an outage or with an angry customer/user.

Third - _count and _bucket are features of histograms and summaries. The sampling frequency doesn't really matter here, as long as it's smaller than the [5m] grouping of the rate() function.

The rate simply gives you data points of "how many occurrences happened during these five minutes ([5m]).

General note - the rate() concept in Prometheus is causing a lot of confusion. It's debated between too many people. They should have probably called it something else.