0
votes

InfluxDB version used: 1.8.0

Given a time series db that is used for storing e.g. temperatures from iot sensors (on different locations).
The sensors are queried e.g. every other minute.
Now the maximum temperature per sensor for the last hour can be queried using

   select max(*) from temperatures where time >= now() - 1h group by location 
   name: temperatures
   tags: location=collector
   time                 max_temperature
   ----                 ---------------
   2020-06-24T17:41:34Z 34.8
   
   name: temperatures
   tags: location=outside
   time                 max_temperature
   ----                 ---------------
   2020-06-24T17:43:34Z 23.4

I'm now would like to keep the max temperatures for every hour and for every day for a certain period of time.
So naturally I would use a retention policy and continuous queries.

Lets say I want to store the the maximum temperature by the hour for a month:

   show RETENTION POLICIES on iotsensors
   name       duration   shardGroupDuration replicaN default
   ----       --------   ------------------ -------- -------
   lastmonth  744h0m0s   24h0m0s            1        false

The continuous query looks like this:

   CREATE CONTINUOUS QUERY max_temperatures_per_hour ON iotsensors 
   BEGIN 
     SELECT max(temperature) INTO iotsensors.lastmonth.max_temperatures_per_hour FROM iotsensors.autogen.temperatures GROUP BY time(1h), location TZ('Europe/Berlin') 
   END

By the nature of the GROUP BY time(1h) term, the exact time of the temperature is lost.
Especially when the data is condensed for a whole day in the second step FROM iotsensors.lastmonth.max_temperatures_per_hour GROUP BY time(1d) the resolution is getting even more coarse. (setting it to midnight of each day 00:00:00)

   select max from iotmeasurements.last2years.max_temperatures_per_day where time >= now() - 4d group by location tz('Europe/Berlin')

   name: max_temperatures_per_day
   tags: location=collector
   time                      max
   ----                      ---
   2020-06-21T00:00:00+02:00 80.9
   2020-06-22T00:00:00+02:00 78.5
   2020-06-23T00:00:00+02:00 101.2

   name: min_max_temperatures_per_day
   tags: location=outside
   time                      max
   ----                      ---
   2020-06-21T00:00:00+02:00 21.8
   2020-06-22T00:00:00+02:00 22.5
   2020-06-23T00:00:00+02:00 22.8

I do know that this the expected and documented behaviour
https://docs.influxdata.com/influxdb/v1.8/query_language/explore-data/#group-by-time-intervals

However, the information of when exactly the maximum value was recorded is a valuable information which I'd like to keep.

Is there any way to store the exact timestamp of the record when downsampling?
I'd prefer to keep the timestamp inside the time field like

   tags: location=collector
   time                 max
   ----                 ---
   2020-06-20T04:30:40Z 80.9
   2020-06-21T04:22:00Z 78.5
   2020-06-22T04:53:10Z 101.2

Alternatively but a second best solution would be to add a timestamp field for each downsampled record

   time                      max    timestamp
   ----                      ---    ---------
   2020-06-20T00:00:00+02:00 80.9   2020-06-20T04:30:40Z
   2020-06-21T00:00:00+02:00 78.5   2020-06-21T04:22:00Z
   2020-06-22T00:00:00+02:00 101.2  2020-06-22T04:53:10Z

For this I needed to be able to query the time into a separate field, wouldn't I.
But my attempts weren't successful so far. Something I tried was this:

SELECT max(temperature),time as timestamp FROM temperatures GROUP BY time(60m),"location"

I'd consider to move to InfluxDB 2.0 if that was a prerequesit for a solution to my problem.

1

1 Answers

0
votes

So far I haven't found a solution with using solely InfluxDB.
The original question was based on the misconception that there always is one single maximum value over the time frame used for downsampling. Given a series of data points like this.

   name: max_temperatures_per_day
   tags: location=collector
   time                      max
   ----                      ---
   2020-06-20T04:30:40Z 80.9
   2020-06-21T04:22:00Z 78.5
   2020-06-22T04:53:10Z 101.2
   2020-06-22T05:33:10Z 73.3
   2020-06-22T05:41:10Z 65.0
   2020-06-22T05:53:10Z 48.2
   2020-06-22T05:56:10Z 73.3
   2020-06-22T10:30:10Z 54.3
   2020-06-22T12:30:10Z 63.7
   2020-06-22T18:03:10Z 101.2
   2020-06-22T18:20:10Z 90.2

it would be possible to identify exactly one point in time having the maximum value with the 4th hour of the day 2020-06-22T04:53:10Z 101.2 but for the fifth hour it's not possible since the maximum value ocured at 5:33 as well at 5:56. Downsampling the data to the resolution of one day (24h) makes it even worse as the maximum value (101.2) ocured 4:53AM as well as 6:03PM that given day. Which of this possibly multiple points in time should be kept?

However using Kapacitor for carrying out the continues queries the original desired result can be achieved. Starting with from this article https://docs.influxdata.com/kapacitor/v1.5/guides/continuous_queries/, it's possible to setup a query like this


batch
  |query('SELECT * FROM "iotmeasurements"."autogen".temperatures')
    .period(1h)
    .every(1h)
    .groupBy('location')
    .align()
  |max('temperature')
    .as('max_temp')
    .usePointTimes()
  |influxDBOut()
    .database('iotmeasurements')
    .retentionPolicy('lastmonth')
    .measurement('max_temperatures')
    .precision('s')

This will keep the point time where the maximum value ocured first. In the example below, the data point at 5:33AM would be kept and the same value at 5:56AM would be skipped.
I'm not entirely sure if usePointTimes() (https://docs.influxdata.com/kapacitor/v1.5/nodes/influx_q_l_node/#usepointtimes) is needed.

In case loosing the record of later ocurances of the maximum value in the downsampling time frame is acceptable, this might be a solution. Even though, running a second service is needed for this. Adding an additional point of possible fail overs. Another disadvantage of using Kapacitor is that it seems to be not possible to perform a downsampling for the past.
One may carry out a GROUP BY time query like this SELECT max(temperature) INTO ... FROM temperatures WHERE time >= now() - 1w GROUP BY time(1h),"location" outside a continuous query to do the downsampling for measurement points from the past inside influxdb itself.
There seems to be now way for doing so for Kapacitor 'ticks'.