0
votes

I am looking for a solution to aggregate a Bigquery table with about 100 columns using sum() function. But the following query is not allowed in Bigquery standard SQL.

select sum(*)
from `dataset.Intermediate_Tables.eventCat_dummies`
group by Id

The reason I want to do this kind of aggregation in Bigquery is due to its ability to handle large amount of data. I tried to do the same aggregation in jupyter notebook but it failed everytime. It might be because of the large size of data (7.3 GiB csv file). The code I tried is like the following:

df_type = type_dummies.groupby('Id', sort=False).sum()

Anyone can give any suggestions and/or alternatives that how I can get aggregated data of this large dataset?

UPDATE WITH SAMPLE INPUT AND OUTPUT

Input data

Id col1 col2 col3 col4
1  0    0    0    1
2  0    1    1    1
1  1    0    0    0
4  0    0    0    0
19 0    0    0    0
2  1    1    1    1

Desired output

Id col1_sum col2_sum col3_sum col4_sum
1  1        0        0        1
2  1        2        2        2
4  0        0        0        0
19 0        0        0        0

In my original dataset, there are 100 columns and 40 million rows.

1
simplified example with input and expected output would help us to answer your questionMikhail Berlyant
@MikhailBerlyant Thanks for your comment. I will update it with a simplified example.cloudexplorer

1 Answers

3
votes

Below is example for BigQuery StandardSQL

#standardSQL
WITH `project.dataset.table` AS (
  SELECT 1 id, 1 a, 2 b, 3 c UNION ALL
  SELECT 1, 4, 5, 6 UNION ALL
  SELECT 2, 7, 8, 9
)
SELECT id, 
  SUM((
    SELECT SUM(CAST(SPLIT(pair, ':')[SAFE_OFFSET(1)] AS INT64)) 
    FROM UNNEST(SPLIT(REGEXP_REPLACE(TO_JSON_STRING(t), r'[{}]', ''))) pair 
    WHERE SPLIT(pair, ':')[SAFE_OFFSET(0)] != '"id"'
  )) val
FROM `project.dataset.table` t
GROUP BY id    

with result as

Row id  val  
1   1   21   
2   2   24     

As you can see - above has no dependency on number of columns
Also, it assumes all columns except of id column needs to be SUMed. if you have more columns to be excluded - you can adjust WHERE clause respectively

Update based on provided details
So you want to sum each and every individual column (initially I read your question as if you want to sum all the column's values together by id)

#standardSQL
WITH `project.dataset.table` AS (
  SELECT 1 id, 1 a, 2 b, 3 c UNION ALL
  SELECT 1, 4, 5, 6 UNION ALL
  SELECT 2, 7, 8, 9
)
SELECT id, 
  REPLACE(SPLIT(pair, ':')[SAFE_OFFSET(0)], '"', '')  col,
  SUM(CAST(SPLIT(pair, ':')[SAFE_OFFSET(1)] AS INT64)) val
FROM (
  SELECT id, 
    ARRAY(
      SELECT pair
      FROM UNNEST(SPLIT(REGEXP_REPLACE(TO_JSON_STRING(t1), r'[{}]', ''))) pair 
      WHERE SPLIT(pair, ':')[SAFE_OFFSET(0)] != '"id"'
    ) arr
  FROM `project.dataset.table` t1
) t2,
UNNEST(arr) pair
GROUP BY id, col
ORDER BY id, col

this will give you below result

Row id  col val  
1   1   a   5    
2   1   b   7    
3   1   c   9    
4   2   a   7    
5   2   b   8    
6   2   c   9    

this result is flattened version of what you need and in most practical use cases is much more efficient than pivoted one

Still, if you want to pivot this result - see https://stackoverflow.com/a/35808045/5221944