Querying time series in Postgress

Question

I am using Postgres to keep track of time series data.

---------
|ts|value|
|--------|
|1 |5    |
|--------|
|2 |2    |
|--------|
|5 |10   |
----------

The meaning of the value field is "number of events that happened since previous timestamp". For example in seconds 3-5 there were 10 events, as appears in the value of ts 5.

Our user can ask something like: "show me a line chart of seconds 0-6 which has data points in intervals of 4 seconds". So we will show two data points, for seconds 0 and 4, each one summarizing the number of events up to the next point. So the line chart will use this data:

---------
|ts|value|
|--------|
|0 |7    |
|--------|
|4 |10   |
----------

(since 5+2=7 and 10=10)

We use this query to get this data. The user can ask arbitrary start (time), end (time) and interval values. Also ts is of type integer (but can be adjusted to timestamp).

SELECT start+round((ts-start)/interval)*interval as ts1, sum(value)
FROM events
WHERE ts >= start AND ts <= end
GROUP BY ts1

Is this the right query, and more importantly the right approach?
I assume indexes on ts will be ignored, anything I can do about that?
Is there anyway to add linear interpolation logic to the query such that the 10 events that happened in seconds 3-5 (in original table) will not go in full to second 4 (of result set) since third of them actually happened in second 3 (of original data)?

you can use this sqlfiddle.

A complete solution that includes the proportioned events is not going to be pretty or efficient because GROUP BY cannot be used since, in general, each recorded interval can be part of an arbitrary number of desired intervals. This means you are need non-optimizing subselects. I may play with the problem later just as a challenge but its not going to be real practical unless the data set is small. — Dwayne Towell
Please check the spelling of your title: wiki.postgresql.org/wiki/Identity_Guidelines — Cœur

harmic harmic · Accepted Answer · 2014-02-24T06:32:13

One problem with the way you are currently doing it is that it does not generate a data point in any invervals which do not have any sample data. For example, if the user wants a chart from seconds 0 - 10 in steps of 1, then your chart won't have any points after 5. Maybe that doesn't matter in your use case though.

Another issue, as you indicated, it would be nice to be able to use some kind of linear interpolation to attribute the measurements in case the resolution of the requested plots is greater than the available data.

To solve the first of these, instead of selecting data purely from the sample table, we can join together the data with a generated series that matches the user's request. The latter can be generated using this:

SELECT int4range(rstart, rstart+1) AS srange 
FROM generate_series(0,10,1) AS seq(rstart)

The above query will generate a series of ranges, from 0 to 10 with a step size of 1. The output looks like this:

 srange
---------
 [0,1)
 [1,2)
 [2,3)
 [3,4)
 [4,5)
 [5,6)
 [6,7)
 [7,8)
 [8,9)
 [9,10)
 [10,11)
(11 rows)

We can join this to the data table, using the && operator (which filters on overlap).

The second point can be addressed by calculating the proportion of each data row which falls into each sample window.

Here is the full query:

SELECT lower(srange) AS t,
    sum (CASE 
        -- when data range is fully contained in sample range
        WHEN drange <@ srange THEN value
        -- when data range and sample range overlap, calculate the ratio of the intersection
        -- and use that to apportion the value
        ELSE CAST (value AS DOUBLE PRECISION) * (upper(drange*srange) - lower(drange*srange)) / (upper(drange)-lower(drange))
    END) AS value
FROM (
    -- Generate the range to be plotted (the sample ranges).
    -- To change the start / end of the range, change the 1st 2 arguments
    -- of the generate_series. To change the step size change BOTH the 3rd
    -- argument and the amount added to rstart (they must be equal).
    SELECT int4range(rstart, rstart+1) AS srange FROM generate_series(0,10,1) AS seq(rstart)
) AS s
LEFT JOIN (
    -- Note the use of the lag window function so that for each row, we get
    -- a range from the previous timestamp up to the current timestamp
    SELECT int4range(coalesce(lag(ts) OVER (order by ts), 0), ts) AS drange, value FROM data
) AS d ON srange && drange
GROUP BY lower(srange)
ORDER BY lower(srange)

Result:

 t  |      value
----+------------------
  0 |                5
  1 |                2
  2 | 3.33333333333333
  3 | 3.33333333333333
  4 | 3.33333333333333
  5 |
  6 |
  7 |
  8 |
  9 |
 10 |
(11 rows)

It is not likely any index will be used on ts in this query as it stands, and if the data table is large then performance is going to be dreadful.

There are some things you could try to help with this. One suggestion could be to redesign the data table such that the first column contains the time range of the data sample, rather than just the ending time, and then you could add a range index. You could then remove the windowing function from the second subquery, and hopefully the index can be used.

Read up on range types here.

Caveat Emptor: I have not tested this other than on the tiny data sample you supplied. I have used something similar to this for a somewhat similar purpose though.

Querying time series in Postgress

1 Answers