0
votes

Suppose I have two tables: intervals contains index intervals (its columns are i_min and i_max) and values contains indexed values (with columns i and x). Here's an example:

 values:        intervals:
+---+---+   +-------+-------+
| i | x |   | i_min | i_max |
+-------+   +---------------+
| 1 | 1 |   |   1   |   4   |
| 2 | 0 |   |   6   |   6   | 
| 3 | 4 |   |   6   |   6   | 
| 4 | 9 |   |   6   |   6   |
| 6 | 7 |   |   7   |   9   |
| 7 | 2 |   |  12   |  17   |
| 8 | 2 |   +-------+-------+
| 9 | 2 |
+---+---+

I want to sum the values of x for each interval:

       result:
+-------+-------+-----+
| i_min | i_max | sum | 
+---------------------+ 
|   1   |   4   |  13 | // 1+0+4+9
|   6   |   6   |   7 | 
|   6   |   6   |   7 | 
|   6   |   6   |   7 | 
|   7   |   9   |   6 | // 2+2+2
|  12   |  17   |   0 |
+-------+-------+-----+

In some SQL engines, this could be done using:

SELECT
  i_min,
  i_max,
  (SELECT SUM(x)
   FROM values 
   WHERE i BETWEEN intervals.i_min AND intervals.i_max) AS sum_x
FROM
  intervals

except that type of query is not allowed by BigQuery ("Subselect not allowed in SELECT clause." or "LEFT OUTER JOIN cannot be used without a condition that is an equality of fields from both sides of the join." depending on the syntax used).

There must be a way to do this with window functions, but I can't figure out how — all examples I've seen have the partition as part of the table. Is there an option that doesn't use CROSS JOIN? If not, what's the most efficient way to do this CROSS JOIN?

Some notes on my data:

  • Both tables contain many (10⁸-10⁹) rows.
  • There might be repetitions in intervals, not in i.
  • But two intervals in intervals are either the same, either entirely disjoint (no overlaps).
  • The union of all intervals is typically close to the set of all values of i (so it forms a partition of this space).
  • Intervals might be large (say, i_max-i_min < 10⁶).
1
Please edit your question and provide sample data and desired results. Also explain whether the intervals are overlapping, compact, sparse . . . that might affect the solution.Gordon Linoff
Make sure to enable standard SQL to use these kinds of features in your query. See also the migration guide.Elliott Brossard
@GordonLinoff good point, done.Ted
@ElliottBrossard also done (but I don't have a hard constraint on using one dialect or the other).Ted

1 Answers

3
votes

Try below - BigQuery Standard SQL

#standardSQL
SELECT
  i_min, i_max, SUM(x) AS  sum_x
FROM (
  SELECT i_min, i_max, ROW_NUMBER() OVER() AS line FROM `project.dataset.intervals`
) AS intervals
JOIN (SELECT i, x FROM `project.dataset.values` UNION ALL SELECT NULL, 0) AS values
ON values.i BETWEEN intervals.i_min AND intervals.i_max OR values.i IS NULL 
GROUP BY i_min, i_max, line
-- ORDER BY i_min

you can play/test with dummy data as below

#standardSQL
WITH intervals AS (
  SELECT  1 AS i_min, 4 AS i_max UNION ALL
  SELECT  6, 6 UNION ALL
  SELECT  6, 6 UNION ALL
  SELECT  6, 6 UNION ALL
  SELECT  7, 9 UNION ALL
  SELECT 12, 17 
),
values AS (
  SELECT 1 AS i, 1 AS x UNION ALL
  SELECT 2, 0 UNION ALL
  SELECT 3, 4 UNION ALL
  SELECT 4, 9 UNION ALL
  SELECT 6, 7 UNION ALL
  SELECT 7, 2 UNION ALL
  SELECT 8, 2 UNION ALL
  SELECT 9, 2 
)
SELECT
  i_min, i_max, SUM(x) AS  sum_x
FROM (SELECT i_min, i_max, ROW_NUMBER() OVER() AS line FROM intervals) AS intervals
JOIN (SELECT i, x FROM values UNION ALL SELECT NULL, 0) AS values
ON values.i BETWEEN intervals.i_min AND intervals.i_max OR values.i IS NULL 
GROUP BY i_min, i_max, line
-- ORDER BY i_min