1
votes

I am analysing Google Analytics data in Google BigQuery using standard SQL.

I often run into the case of wanting to summarise hit level information into session level metrics. This involves trawling through unnested hits and filtering for just the relevant info.

For the example of booking process analysis you end up with a table of data at hit level, something along the lines of:

customer, bookingStep

I thus need to denormalise this into this sort of schema:

customer, step1, step2, etc...

Here's the sort of approach I'm currently using:

WITH normalised AS (
  # data created from trawling through hit level GA page views and selecting relevant rows
  SELECT 1 AS customer, 'step1' AS step UNION ALL 
  SELECT 1, 'step1' UNION ALL 
  SELECT 1, 'step2' UNION ALL 
  SELECT 1, 'step3' UNION ALL 
  SELECT 1, 'step4' UNION ALL 
  SELECT 1, 'step5' UNION ALL 
  SELECT 2, 'step1' UNION ALL 
  SELECT 2, 'step2' UNION ALL 
  SELECT 2, 'step3' UNION ALL 
  SELECT 2, 'step4' UNION ALL 
  SELECT 3, 'step1' UNION ALL 
  SELECT 3, 'step2' UNION ALL 
  SELECT 3, 'step3' UNION ALL 
  SELECT 3, 'step4' UNION ALL 
  SELECT 4, 'step1' UNION ALL 
  SELECT 4, 'step2' UNION ALL 
  SELECT 4, 'step3' UNION ALL 
  SELECT 4, 'step4' UNION ALL 
  SELECT 5, 'step1' UNION ALL 
  SELECT 5, 'step2' UNION ALL 
  SELECT 5, 'step3' UNION ALL 
  SELECT 6, 'step1' UNION ALL 
  SELECT 6, 'step2' UNION ALL 
  SELECT 7, 'step1'
)

The query:

/* aggregate to remove duplicate entries */
SELECT
  customer,
  CASE WHEN SUM(step1) > 0 THEN 1 ELSE 0 END AS step1,
  CASE WHEN SUM(step2) > 0 THEN 1 ELSE 0 END AS step2
  # for each step
FROM (
  /* denormalise into multiple fields */
  SELECT DISTINCT
    customer, 
    CASE WHEN step = 'step1' THEN 1 ELSE 0 END AS step1,
    CASE WHEN step = 'step2' THEN 1 ELSE 0 END AS step2
    # for each step
  FROM normalised
)
GROUP BY customer
ORDER BY customer ASC

Is there perhaps a better, more efficient way of doing this? My solution seems to work, but given the amount of code involved I can't help thinking there might be more succinct methods.

2

2 Answers

1
votes

I think you can do what you want in one step:

SELECT customer, 
       MAX(CASE WHEN step = 'step1' THEN 1 ELSE 0 END) AS step1,
       MAX(CASE WHEN step = 'step2' THEN 1 ELSE 0 END) AS step2
FROM normalised
GROUP BY customer;
0
votes

Below is a little less wordy version for BigQuery Standard SQL

#standardSQL
SELECT 
  customer, 
  SIGN(COUNTIF(step = 'step1')) AS step1,
  SIGN(COUNTIF(step = 'step2')) AS step2
FROM normalised
GROUP BY customer
ORDER BY customer ASC  

Meantime, for many practical usage you most likely need to apply more processing on top of this output which is not going to be as flexible as you would wish - because of hardcoded column names (assuming you actually have much more than 2 or 5 or just want to be more dynamic)

I would suggest considering "denormalising" using arrays as below - from my experience it gives you way more flexibility in further processing

#standardSQL
SELECT 
  customer, 
  ARRAY_AGG(DISTINCT step ORDER BY step) AS steps
FROM normalised
GROUP BY customer
ORDER BY customer ASC   

result is:

customer    steps    
--------    -----
1           step1    
            step2    
            step3    
            step4    
            step5    
2           step1    
            step2    
            step3    
            step4    
...         ....

Ror in many cases denormalizing into string also works

#standardSQL
SELECT customer, 
       STRING_AGG(DISTINCT step ORDER BY step) AS steps
FROM normalised
GROUP BY customer
ORDER BY customer ASC   

with output as simple as:

customer    steps    
--------    -----
1           step1,step2,step3,step4,step5    
2           step1,step2,step3,step4  
3           step1,step2,step3,step4  
4           step1,step2,step3,step4  
5           step1,step2,step3    
6           step1,step2  
7           step1