2
votes

I am trying to use the min() value of a timestamp as a starting point and then group data by 30 day intervals in order to get a count of occurrences for each unique value within the timestamp date range as columns

i have two tables that i am joining together to get a count. Table 1 (page_creation) has 2 columns labeled link and dt_crtd. Table 2(page visits) has 2 other columns labeled url and date. the tables are being joined by joining table1.link = table2.pagevisits.

After the join i get a table similar to this:

+-------------------+------------------------+
| url               |     date               |
+-------------------+------------------------+
| www.google.com    | 2018-01-01 00:00:00'   |
| www.google.com    | 2018-01-02 00:00:00'   |
| www.google.com    | 2018-02-01 00:00:00'   |
| www.google.com    | 2018-02-05 00:00:00'   |
| www.google.com    | 2018-03-04 00:00:00'   |
| www.facebook.com  | 2014-01-05 00:00:00'   |
| www.facebook.com  | 2014-01-07 00:00:00'   |
| www.facebook.com  | 2014-04-02 00:00:00'   |
| www.facebook.com  | 2014-04-10 00:00:00'   |
| www.facebook.com  | 2014-04-11 00:00:00'   |
| www.facebook.com  | 2014-05-01 00:00:00'   |
| www.twitter.com   | 2016-02-01 00:00:00'   |
| www.twitter.com   | 2016-03-04 00:00:00'   |
+---------------------+----------------------+

what i am trying to get is results that pull this :

+-------------------+------------------------+------------+------------+-------------+
| url               | MIN_Date               | Interval 1  | Interval 2|  Interval 3 |
+-------------------+------------------------+-------------+-----------+-------------+
| www.google.com    | 2018-01-01 00:00:00'   |  2          |  2        |  1      
| www.facebook.com  | 2014-01-05 00:00:00'   |  2          |  0        |  1
| www.twitter.com   | 2016-02-01 00:00:00'   |  1          |  1        |  0    
+---------------------+----------------------+-------------+-----------+-------------+

So the 30 day intervals begin from the min(date) as shown in Interval 1 and are counted every 30 days.

Ive looked at other questions such as :

Group rows by 7 days interval starting from a certain date

MySQL query to select min datetime grouped by 30 day intervals

However it did not seem to answer my specific problem.

Ive also looked into pivot syntax but noticed it is only supported for certain DBMS.

Any help would be greatly appreciated.

Thank you.

3
are you looking to get a variable number of columns, depending on how many 30 day intervals are found?Felipe Hoffa

3 Answers

2
votes

If I understood your question clearly, you want to calculate page visits between 30 , 60 , 90 days intervals after page creation. If it's the requirement, try below SQL code :-

select a11.url
,Sum(case when a12.date between a11.dt_crtd and a11.dt_crtd+30 then 1 else 0) Interval_1    
,Sum(case when a12.date between a11.dt_crtd+31 and a11.dt_crtd+60 then 1 else 0) Interval_2
,Sum(case when a12.date between a11.dt_crtd+61 and a11.dt_crtd+90 then 1 else 0) Interval_3 
from page_creation a11
join page_visits a12
on a11.link = a12.url
group by a11.url
1
votes

If you are using BigQuery, I would recommend:

  • countif() to count a boolean value
  • timestamp_add() to add intervals to timestamps

The exact boundaries are a bit vague, but I would go for:

select pc.url,
       countif(pv.date >= pc.dt_crtd and
               pv.date < timestamp_add(pc.dt_crtd, interval 30 day
              ) as Interval_00_29,    
       countif(pv.date >= timestamp_add(pc.dt_crtd, interval 30 day) and
               pv.date < timestamp_add(pc.dt_crtd, interval 60 day
              ) as Interval_30_59,    
       countif(pv.date >= timestamp_add(pc.dt_crtd, interval 60 day) and
               pv.date < timestamp_add(pc.dt_crtd, interval 90 day
              ) as Interval_60_89
from page_creation pc join
     page_visits pv
     on pc.link = pv.url
group by pc.url
1
votes

The way I am reading your scenario and especially based on example of After the join i get a table similar to ... is that you have two tables that you need to UNION - not to JOIN

So, based on that reading below example is for BigQuery Standard SQL (project.dataset.page_creation and project.dataset.page_visits are here just to mimic your Table 1 and Table2)

#standardSQL
WITH `project.dataset.page_creation` AS (
  SELECT 'www.google.com' link, TIMESTAMP '2018-01-01 00:00:00' dt_crtd UNION ALL
  SELECT 'www.facebook.com', '2014-01-05 00:00:00' UNION ALL
  SELECT 'www.twitter.com', '2016-02-01 00:00:00' 
), `project.dataset.page_visits` AS (
  SELECT 'www.google.com' url, TIMESTAMP '2018-01-02 00:00:00' dt UNION ALL
  SELECT 'www.google.com', '2018-02-01 00:00:00' UNION ALL
  SELECT 'www.google.com', '2018-02-05 00:00:00' UNION ALL
  SELECT 'www.google.com', '2018-03-04 00:00:00' UNION ALL
  SELECT 'www.facebook.com', '2014-01-07 00:00:00' UNION ALL
  SELECT 'www.facebook.com', '2014-04-02 00:00:00' UNION ALL
  SELECT 'www.facebook.com', '2014-04-10 00:00:00' UNION ALL
  SELECT 'www.facebook.com', '2014-04-11 00:00:00' UNION ALL
  SELECT 'www.facebook.com', '2014-05-01 00:00:00' UNION ALL
  SELECT 'www.twitter.com', '2016-03-04 00:00:00' 
), `After the join` AS (
  SELECT url, dt FROM `project.dataset.page_visits` UNION DISTINCT
  SELECT link, dt_crtd FROM `project.dataset.page_creation`
)
SELECT 
  url, min_date, 
  COUNTIF(dt BETWEEN min_date AND TIMESTAMP_ADD(min_date, INTERVAL 29 DAY)) Interval_1,
  COUNTIF(dt BETWEEN TIMESTAMP_ADD(min_date, INTERVAL 30 DAY) AND TIMESTAMP_ADD(min_date, INTERVAL 59 DAY)) Interval_2,
  COUNTIF(dt BETWEEN TIMESTAMP_ADD(min_date, INTERVAL 60 DAY) AND TIMESTAMP_ADD(min_date, INTERVAL 89 DAY)) Interval_3
FROM (
  SELECT url, dt, MIN(dt) OVER(PARTITION BY url ORDER BY dt) min_date
  FROM `After the join`
)
GROUP BY url, min_date

with result as

Row url                 min_date                    Interval_1  Interval_2  Interval_3   
1   www.facebook.com    2014-01-05 00:00:00 UTC     2           0           1    
2   www.google.com      2018-01-01 00:00:00 UTC     2           2           1    
3   www.twitter.com     2016-02-01 00:00:00 UTC     1           1           0