Fill the table with data for missing date (postgresql, redshift)

Question

I'm trying to fill daily data for missing dates and can not find an answer, please help.

My daily_table example:

      url          | timestamp_gmt | visitors | hits  | other.. 
-------------------+---------------+----------+-------+-------
 www.domain.com/1  | 2016-04-12    |   1231   | 23423 |
 www.domain.com/1  | 2016-04-13    |   1374   | 26482 |
 www.domain.com/1  | 2016-04-17    |   1262   | 21493 |
 www.domain.com/2  | 2016-05-09    |   2345   | 35471 |

Expected result: I wand to fill this table with data for every domain and every day which just copy data from previous date:

      url          | timestamp_gmt | visitors | hits  | other.. 
-------------------+---------------+----------+-------+-------
 www.domain.com/1  | 2016-04-12    |   1231   | 23423 |
 www.domain.com/1  | 2016-04-13    |   1374   | 26482 |
 www.domain.com/1  | 2016-04-14    |   1374   | 26482 |     <-added
 www.domain.com/1  | 2016-04-15    |   1374   | 26482 |     <-added
 www.domain.com/1  | 2016-04-16    |   1374   | 26482 |     <-added
 www.domain.com/1  | 2016-04-17    |   1262   | 21493 |
 www.domain.com/2  | 2016-05-09    |   2345   | 35471 |

I can move a part of the logic into php, but it is undesirable, because my table has billions of missing dates.

SUMMARY:

During a few last days I foud out that:

Amazon-redshift works with 8-th version of PostgreSql, that's why it does not support such a beautiful command like JOIN LATERAL
Redshift also does not support generate_series and CTEs
But it supports simple WITH (thank you @systemjack) but WITH RECURSIVE does not

An obvious question: why? Wouldn't it make more sense to leave the gaps as they are, and let the web pages/whatever choose how to display this? — Tom Lord
This is a requirement, because our customers use tables directly, not via some interface. — D.Dimitrioglo
I don not know about CTE, i will read some documentation and answer later — D.Dimitrioglo

klin klin · Accepted Answer · 2016-06-19T11:09:08

Look at the idea behind the query:

select distinct on (domain, new_date) *
from (
    select new_date::date 
    from generate_series('2016-04-12', '2016-04-17', '1d'::interval) new_date
    ) s 
left join a_table t on date <= new_date
order by domain, new_date, date desc;

  new_date  |     domain      |    date    | visitors | hits  
------------+-----------------+------------+----------+-------
 2016-04-12 | www.domain1.com | 2016-04-12 |     1231 | 23423
 2016-04-13 | www.domain1.com | 2016-04-13 |     1374 | 26482
 2016-04-14 | www.domain1.com | 2016-04-13 |     1374 | 26482
 2016-04-15 | www.domain1.com | 2016-04-13 |     1374 | 26482
 2016-04-16 | www.domain1.com | 2016-04-13 |     1374 | 26482
 2016-04-17 | www.domain1.com | 2016-04-17 |     1262 | 21493
(6 rows)

You'll have to choose start and end dates according to your requirements. The query may be quite expensive (you mentioned about billions gaps) so apply it with caution (test on a smaller data subset or execute by stages).

In the absence of generate_series() you can create your own generator. Here is an interesting example. Views from the cited article can be used instead of generate_series(). For example, if you need the period '2016-04-12' + 5 days:

select distinct on (domain, new_date) *
from (
    select '2016-04-12'::date+ n new_date
    from generator_16
    where n < 6
    ) s 
left join a_table t on date <= new_date
order by domain, new_date, date desc;

you'll get the same result like in the first example.

Fill the table with data for missing date (postgresql, redshift)

4 Answers