3
votes

I'm trying to sum values for specific date range around current row. Since Bigquery do not suppport dated ranges in windows functions, I use self join, like this:

with test_data as (
   select 1 val1, 7 val2, 'ord001' id, timestamp('2019-01-01 04:00:00') dt_order
   union all
   select 2 val1, 14 val2, 'ord002' id, timestamp('2019-01-02 05:00:00') dt_order
   union all
   select 3 val1, 21 val2, 'ord003' id, timestamp('2019-01-03 06:00:00') dt_order
)

,revenue_coeff as (
select 
   td.id,
   td.val1 *
   (select sum(td1.val2) / sum(td1.val1)
    from test_data td1
    where td1.dt_order >= timestamp_sub(td.dt_order, interval 24 hour) and
       td1.dt_order < timestamp_add(td.dt_order, interval 6 minute)
    )
from test_data td
)

select * from revenue_coeff

This toy query works just fine. But when I'm trying to use a real BigQuery table, I get a 'LEFT OUTER JOIN cannot be used without a condition that is an equality of fields from both sides of the join' message. How can I implement such a query in BQ? Thanks in advance!

2
You have a low rate. Important on SO - you can mark accepted answer by using the tick on the left of the posted answer, below the voting. See meta.stackexchange.com/questions/5234/… for why it is important! Also important to vote on answer. Vote up answers that are helpful. ... You can check about what to do when someone answers your question - stackoverflow.com/help/someone-answers. Following these simple rules you increase your own reputation score and at the same time you keep us motivated to answer your questions :o) please consider!Mikhail Berlyant

2 Answers

2
votes

Below is for BigQuery Standard SQL

I will first answer your question at the end of your post - but than will address your statement at the top of your post. So ...

I get a 'LEFT OUTER JOIN cannot be used without a condition that is an equality of fields from both sides of the join' message. How can I implement such a query in BQ?

#standardSQL
WITH `project.dataset.test_data` AS (
   SELECT 1 val1,  7 val2, 'ord001' id, TIMESTAMP('2019-01-01 04:00:00') dt_order UNION ALL
   SELECT 1 val1, 14 val2, 'ord002' id, TIMESTAMP('2019-01-02 05:00:00') dt_order UNION ALL
   SELECT 1 val1, 21 val2, 'ord003' id, TIMESTAMP('2019-01-03 06:00:00') dt_order
), revenue_coeff AS (
  SELECT 
    td1.id, 
    td1.val1  * SUM(td2.val2) / SUM(td2.val1)
  FROM `project.dataset.test_data` td1
  CROSS JOIN `project.dataset.test_data` td2
  WHERE td2.dt_order >= TIMESTAMP_SUB(td1.dt_order, INTERVAL 24 HOUR) 
  AND   td2.dt_order <  TIMESTAMP_ADD(td1.dt_order, INTERVAL 6 MINUTE)
  GROUP BY td1.id, td1.val1  
)
SELECT * FROM revenue_coeff   

As you can see - instead of LEFT JOIN you can use CROSS JOIN with ON clause moved into WHERE clause

Since Bigquery do not suppport dated ranges in windows functions ...

Actually, it does support - see example

#standardSQL
WITH `project.dataset.test_data` AS (
   SELECT 1 val1,  7 val2, 'ord001' id, TIMESTAMP('2019-01-01 04:00:00') dt_order UNION ALL
   SELECT 1 val1, 14 val2, 'ord002' id, TIMESTAMP('2019-01-02 05:00:00') dt_order UNION ALL
   SELECT 1 val1, 21 val2, 'ord003' id, TIMESTAMP('2019-01-03 06:00:00') dt_order
), revenue_coeff AS (
  SELECT id, val1  * SUM(val2) OVER(win) / SUM(val1) OVER(win)
  FROM `project.dataset.test_data` td1
  WINDOW win AS (ORDER BY UNIX_SECONDS(dt_order) RANGE BETWEEN 86400 PRECEDING AND 359 FOLLOWING )
)
SELECT * FROM revenue_coeff   

As you can see - the trick is in "converting" timestamp data type into int using UNIX_SECONDS function

Obviously - I recommend you to go with this second option

0
votes

You can alternatively do a left outer join like:

select a.val1, a.id, 
   sum(if(b.dt_order >= timestamp_sub(a.dt_order, interval 24 hour) and b.dt_order <= timestamp_add(a.dt_order, interval 6 minute), b.val2, 0.0))
   /
   sum(if(b.dt_order >= timestamp_sub(a.dt_order, interval 24 hour) and b.dt_order <= timestamp_add(a.dt_order, interval 6 minute), b.val2, 0.0)) 
from test_data a
left join test_data b on 1=1
group by 1,2

However, you have to manage the zero division errors upstream or by adding a case statement within.