8
votes

I use Postgres and i have a large number of rows with values and date per station. (Dates can be separated by several days.)

id      | value | idstation | udate
--------+-------+-----------+-----
1       |  5    | 12        | 1984-02-11 00:00:00
2       |  7    | 12        | 1984-02-17 00:00:00
3       |  8    | 12        | 1984-02-21 00:00:00
4       |  9    | 12        | 1984-02-23 00:00:00
5       |  4    | 12        | 1984-02-24 00:00:00
6       |  8    | 12        | 1984-02-28 00:00:00
7       |  9    | 14        | 1984-02-21 00:00:00
8       |  15   | 15        | 1984-02-21 00:00:00
9       |  14   | 18        | 1984-02-21 00:00:00
10      |  200  | 19        | 1984-02-21 00:00:00

Forgive what may be a silly question, but I'm not much of a database guru.

Is it possible to directly enter a SQL query that will calculate linear regression per station for each date, knowing that the regression must be calculate only with actual id date, previous id date and next id date ?

For example linear regression for id 2 must be calculate with value 7(actual),5(previous),8(next) for dates 1984-02-17 , 1984-02-11 and 1984-02-21

Edit : I have to use regr_intercept(value,udate) but i really don't know how to do this if i have to use only actual, previous and next value/date for each lines.

Edit2 : 3 rows added to idstation(12); id and dates numbers are changed

Hope you can help me, thank you !

4
Not clear; you want to use value as the dependent variable, udate as the independent variable, and partition/stratify by idstation? - joop
Yes something like that. Actually i have to use Regr_intercept(udate,value) for each date of each station but i need to calculate this linear regression with the actual, previous and next date. I saw there is a mistake in my example, the good one must be : For example linear regression for id 2 must be calculate with value 7(actual),5(previous),8(next) for dates 1984-02-23 , 1984-02-21 and 1984-02-26 - Leasye

4 Answers

12
votes

This is the combination of Joop's statistics and Denis's window functions:

WITH num AS (
        SELECT id, idstation
        , (udate - '1984-01-01'::date) as idate -- count in dayse since jan 1984
        , value AS value
        FROM thedata
        )
        -- id + the ids of the {prev,next} records
        --  within the same idstation group
, drag AS (
        SELECT id AS center
                , LAG(id) OVER www AS prev
                , LEAD(id) OVER www AS next
        FROM thedata
        WINDOW www AS (partition by idstation ORDER BY id)
        )
        -- junction CTE between ID and its three feeders
, tri AS (
                  SELECT center AS this, center AS that FROM drag
        UNION ALL SELECT center AS this , prev AS that FROM drag
        UNION ALL SELECT center AS this , next AS that FROM drag
        )
SELECT  t.this, n.idstation
        , regr_intercept(value,idate) AS intercept
        , regr_slope(value,idate) AS slope
        , regr_r2(value,idate) AS rsq
        , regr_avgx(value,idate) AS avgx
        , regr_avgy(value,idate) AS avgy
FROM num n
JOIN tri t ON t.that = n.id
GROUP BY t.this, n.idstation
        ;

Results:

INSERT 0 7
 this | idstation |     intercept     |       slope       |        rsq        |       avgx       |       avgy       
------+-----------+-------------------+-------------------+-------------------+------------------+------------------
    1 |        12 |               -46 |                 1 |                 1 |               52 |                6
    2 |        12 | -24.2105263157895 | 0.578947368421053 | 0.909774436090226 | 53.3333333333333 | 6.66666666666667
    3 |        12 | -10.6666666666667 | 0.333333333333333 |                 1 |             54.5 |              7.5
    4 |        14 |                   |                   |                   |               51 |                9
    5 |        15 |                   |                   |                   |               51 |               15
    6 |        18 |                   |                   |                   |               51 |               14
    7 |        19 |                   |                   |                   |               51 |              200
(7 rows)

The clustering of the group-of-three can probably be done more elegantly using a rank() or row_number() function, which would also allow larger sliding windows to be used.

1
votes
DROP SCHEMA zzz CASCADE;
CREATE SCHEMA zzz ;
SET search_path=zzz;

CREATE TABLE  thedata
        ( id      INTEGER NOT NULL PRIMARY KEY
        , value INTEGER NOT NULL
        , idstation  INTEGER NOT NULL
        , udate DATE NOT NULL
        );
INSERT INTO thedata(id,value,idstation,udate) VALUES
 (1      ,5   ,12       ,'1984-02-21' )
,(2      ,7   ,12       ,'1984-02-23' )
,(3      ,8   ,12       ,'1984-02-26' )
,(4      ,9   ,14       ,'1984-02-21' )
,(5      ,15  ,15       ,'1984-02-21' )
,(6      ,14  ,18       ,'1984-02-21' )
,(7      ,200 ,19       ,'1984-02-21' )
        ;

WITH a AS (
        SELECT idstation
        , (udate - '1984-01-01'::date) as idate -- count in dayse since jan 1984
        , value AS value
        FROM thedata
        )
SELECT  idstation
        , regr_intercept(value,idate)  AS intercept
        , regr_slope(value,idate)  AS slope
        , regr_r2(value,idate)  AS rsq
        , regr_avgx(value,idate)  AS avgx
        , regr_avgy(value,idate)  AS avgy
FROM a
GROUP BY idstation
        ;

output:

 idstation |     intercept     |       slope       |        rsq        |       avgx       |       avgy       
-----------+-------------------+-------------------+-------------------+------------------+------------------
        15 |                   |                   |                   |               51 |               15
        14 |                   |                   |                   |               51 |                9
        19 |                   |                   |                   |               51 |              200
        12 | -24.2105263157895 | 0.578947368421053 | 0.909774436090226 | 53.3333333333333 | 6.66666666666667
        18 |                   |                   |                   |               51 |               14
(5 rows)

Note: if you want a spline-like regression you should also use the lag() and lead() window functions, like in Denis's answer.

0
votes

If the average is ok for you you could use avg build in... Something like

SELECT avg("value") FROM "my_table" WHERE "idstation" = 3;

Should do. For more complicated things you will need to write some pl/SQL-function I'm afraid or check for a addon on PostgreSQL.

0
votes

Look into window functions. If I get your question correctly, lead() and lag() will likely give you precisely what you want. Example usage:

select idstation as idstation,
       id as curr_id,
       udate as curr_date,
       lag(id) over w as prev_id,
       lag(udate) over w as prev_date,
       lead(id) over w as next_id,
       lead(udate) over w as next_date
from dates
window w as (
partition by idstation order by udate, id
)
order by idstation, udate, id

http://www.postgresql.org/docs/current/static/tutorial-window.html