We use Grafana + Prometheus to monitor our infrastructure and recently we added some business focused metrics and I've been having issues with one of the counters we track. It's a session time counter. Basically, each time a session ends, we increase that counter by the time the user spent in that session. So if an user spends 2m using the software, the counter will be incremented by 120000 ms. For a few days that approach worked perfectly fine, but since yesterday when we had a big discrepancy between one instance counter and the rest of them, and that big counter was reset due to part of the service being restarted, I can't get a meaningful single stat panel anymore.
Here's a graph of what happened (this counter has 3 labels that result in >50 label combinations)
The current all time total tracked by this counter is 13.8 years for a 4 day period, but since the counter reset, my single stat metrics have been either -20 years (using diff) or 35 years (using range) for a 24h period. This is not wrong if you don't account for the counter reset, since diff and range will look at min/max/first/current values, but it's not an useful metric anymore.
If I set the timeframe to not include the counter reset, both Diff and Range show very close values to what is expected (our usage is very linear and predictable).
The singlestat panel formula looks like this
sum(dyno_app_music_total_user_listen_time{server=~"[[server]]", clusterId=~"[[clusterid]]"})
How can I handle resets in a counter for a singlestat metric?