I am analysing Google Analytics data in Google BigQuery using standard SQL.
I often run into the case of wanting to summarise hit level information into session level metrics. This involves trawling through unnested hits and filtering for just the relevant info.
For the example of booking process analysis you end up with a table of data at hit level, something along the lines of:
customer, bookingStep
I thus need to denormalise this into this sort of schema:
customer, step1, step2, etc...
Here's the sort of approach I'm currently using:
WITH normalised AS (
# data created from trawling through hit level GA page views and selecting relevant rows
SELECT 1 AS customer, 'step1' AS step UNION ALL
SELECT 1, 'step1' UNION ALL
SELECT 1, 'step2' UNION ALL
SELECT 1, 'step3' UNION ALL
SELECT 1, 'step4' UNION ALL
SELECT 1, 'step5' UNION ALL
SELECT 2, 'step1' UNION ALL
SELECT 2, 'step2' UNION ALL
SELECT 2, 'step3' UNION ALL
SELECT 2, 'step4' UNION ALL
SELECT 3, 'step1' UNION ALL
SELECT 3, 'step2' UNION ALL
SELECT 3, 'step3' UNION ALL
SELECT 3, 'step4' UNION ALL
SELECT 4, 'step1' UNION ALL
SELECT 4, 'step2' UNION ALL
SELECT 4, 'step3' UNION ALL
SELECT 4, 'step4' UNION ALL
SELECT 5, 'step1' UNION ALL
SELECT 5, 'step2' UNION ALL
SELECT 5, 'step3' UNION ALL
SELECT 6, 'step1' UNION ALL
SELECT 6, 'step2' UNION ALL
SELECT 7, 'step1'
)
The query:
/* aggregate to remove duplicate entries */
SELECT
customer,
CASE WHEN SUM(step1) > 0 THEN 1 ELSE 0 END AS step1,
CASE WHEN SUM(step2) > 0 THEN 1 ELSE 0 END AS step2
# for each step
FROM (
/* denormalise into multiple fields */
SELECT DISTINCT
customer,
CASE WHEN step = 'step1' THEN 1 ELSE 0 END AS step1,
CASE WHEN step = 'step2' THEN 1 ELSE 0 END AS step2
# for each step
FROM normalised
)
GROUP BY customer
ORDER BY customer ASC
Is there perhaps a better, more efficient way of doing this? My solution seems to work, but given the amount of code involved I can't help thinking there might be more succinct methods.