2
votes

We use Grafana + Prometheus to monitor our infrastructure and recently we added some business focused metrics and I've been having issues with one of the counters we track. It's a session time counter. Basically, each time a session ends, we increase that counter by the time the user spent in that session. So if an user spends 2m using the software, the counter will be incremented by 120000 ms. For a few days that approach worked perfectly fine, but since yesterday when we had a big discrepancy between one instance counter and the rest of them, and that big counter was reset due to part of the service being restarted, I can't get a meaningful single stat panel anymore.

Here's a graph of what happened (this counter has 3 labels that result in >50 label combinations)

Prometheus graph

The current all time total tracked by this counter is 13.8 years for a 4 day period, but since the counter reset, my single stat metrics have been either -20 years (using diff) or 35 years (using range) for a 24h period. This is not wrong if you don't account for the counter reset, since diff and range will look at min/max/first/current values, but it's not an useful metric anymore.

If I set the timeframe to not include the counter reset, both Diff and Range show very close values to what is expected (our usage is very linear and predictable).

The singlestat panel formula looks like this

sum(dyno_app_music_total_user_listen_time{server=~"[[server]]", clusterId=~"[[clusterid]]"})

How can I handle resets in a counter for a singlestat metric?

1

1 Answers

1
votes

I'm not sure I fully understand your question, but if I had to summarize what I understood is that you have a metric with 3 labels (resulting in 50 different timeseries) and you want to display a singlestat panel that sums all those counters together across all of time.

The way you handle counter resets in Prometheus is by using rate() or, in case you want an absolute value increase(). So the way you would write your query (assuming you wanted the sum of counter increases for all time) is:

sum(increase(dyno_app_music_total_user_listen_time{...}[100y]))

Do note however that this is going to get slower and slower over time, because Prometheus will have to go back and load your 50 timeseries for all time before doing the calculation. (To the point where the number of samples loaded will exceed either the limit configured in Prometheus or the amount of memory available).

What may be more useful than that (and would over time get rid of the spike you experienced "yesterday") is to instead show a graph of the rate of change of your counters over some much shorter time range:

sum(rate(dyno_app_music_total_user_listen_time{...}[1h]))

This would show you (an approximation of) the average number of sessions over the previous hour for any time range you may choose to display on your graph.