@Michael-Desa gives an excellent explanation.
I'd like to augment that answer with a solution to a pretty common metric our company is interested in: "What is the maximum "operation per second" value on a specific measurement field?".
I will use a real-life example from our company.
Scenario Background
We send a lot of data from an RDBMS to redis. When transferring that data, we keep track of 5 counters:
TipTrgUp
-> Updates by a business trigger (stored procedure)
TipTrgRm
-> Removes by a business trigger (stored procedure)
TipRprUp
-> Updates by an unattended auto-repair batch process
TipRprRm
-> Removes by an unattended auto-repair batch process
TipDmpUp
-> Updates by a bulk-dump process
We made a metrics collector that sends the current state of these counters to InfluxDB, with an interval of 1 second (configurable).
Grafana graph 1: low resolution, no true max ops
Here is the grafana query that is useful, but does not show the true max ops when zoomed out (we know it will go to around 500 ops on a normal business day, when no special dumps or maintenance is taking place - otherwise it goes into the thousands):
SELECT
non_negative_derivative(max(TipTrgUp),1s) AS "update/TipTrgUp"
,non_negative_derivative(max(TipTrgRm),1s) AS "remove/TipTrgRm"
,non_negative_derivative(max(TipRprUp),1s) AS "autorepair-up/TipRprUp"
,non_negative_derivative(max(TipRprRm),1s) AS "autorepair-rm/TipRprRm"
,non_negative_derivative(max(TipDmpUp),1s) AS "dump/TipDmpUp"
FROM "$rp"."redis_flux_-transid-d-s"
WHERE
host =~ /$server$/
AND $timeFilter
GROUP BY time($interval),* fill(null)
Sidenotes: $rp
is the name of the retention policy, templated in grafana. We use CQ's to downsample to retention policies with a larger duration. Also note the 1s
as a derivative parameter: it is needed, since the default is different when using GROUP BY. This can be easily overlooked in the InfluxDB documentation.
The graph, seen by 24 hours looks like this:
If we simply use a resolution of 1s (as suggested by @Michael-Desa), an enormous amount of data is transferred from influxdb to the client. It works reasonably well (about 10 seconds), but too slow for us.
Grafana graph 2: low and high resolution, true max ops, slow performance
We can however use subqueries to add the true maxops to this graph, which is a slight improvement. A lot less data is transferred to the client, but the InfluxDB server has to do a lot of number crunching. Series B (with maxops
prepended in the aliases):
SELECT
max(subTipTrgUp) AS maxopsTipTrgUp
,max(subTipTrgRm) AS maxopsTipTrgRm
,max(subTipRprUp) AS maxopsRprUp
,max(subTipRprRm) AS maxopsTipRprRm
,max(subTipDmpUp) AS maxopsTipDmpUp
FROM (
SELECT
non_negative_derivative(max(TipTrgUp),1s) AS subTipTrgUp
,non_negative_derivative(max(TipTrgRm),1s) AS subTipTrgRm
,non_negative_derivative(max(TipRprUp),1s) AS subTipRprUp
,non_negative_derivative(max(TipRprRm),1s) AS subTipRprRm
,non_negative_derivative(max(TipDmpUp),1s) AS subTipDmpUp
FROM "$rp"."redis_flux_-transid-d-s"
WHERE
host =~ /$server$/
AND $timeFilter
GROUP BY time(1s),* fill(null)
)
WHERE $timeFilter
GROUP BY time($interval),* fill(null)
Gives:
Grafana graph 3: low and high resolution, true max ops, high performance, pre-calculate by CQ
Our final solution to these kind of metrics (but only when we need a live view, the subquery approach works fine for ad-hoc graphs) is: use a Continuous Query to pre-calculate the true maxops. We generate CQ's like this:
CREATE CONTINUOUS QUERY "redis_flux_-transid-d-s.maxops.1s"
ON telegraf
BEGIN
SELECT
non_negative_derivative(max(TipTrgUp),1s) AS TipTrgUp
,non_negative_derivative(max(TipTrgRm),1s) AS TipTrgRm
,non_negative_derivative(max(TipRprUp),1s) AS TipRprUp
,non_negative_derivative(max(TipRprRm),1s) AS TipRprRm
,non_negative_derivative(max(TipDmpUp),1s) AS TipDmpUp
INTO telegraf.A."redis_flux_-transid-d-s.maxops"
FROM telegraf.A."redis_flux_-transid-d-s"
GROUP BY time(1s),*
END
From here on, it's trivial to use these maxops measurements in grafana. When downsampling to an RP with longer retention, we again use max()
as the selector function.
Series B (with .maxops
appended in the aliases)
SELECT
max(TipTrgUp) AS "update/TipTrgUp.maxops"
,max(TipTrgRm) AS "remove/TipTrgRm.maxops"
,max(TipRprUp) as "autorepair-up/TipRprUp.maxops"
,max(TipRprRm) as "autorepair-rm/TipRprRm.maxops"
,max(TipDmpUp) as "dump/TipDmpUp.maxops"
FROM "$rp"."redis_flux_-transid-d-s.maxops"
WHERE
host =~ /$server$/
AND $timeFilter
GROUP BY time($interval),* fill(null)
Gives:
When zoomed in to 1s precision, you can see that the graphs become identical:
Hope this helps, TW