0
votes

I'm having a problem with count distinct + case when / with group by rollup in BigQuery. The result did not as I expected. Here's the code:

with tb1 as
(
select 'DN' as geography
       ,'1012658993824' as SKU
       ,1 as pageview

union all

select 'KR' as geography
       ,'1012658993824' as SKU
       ,7 as pageview

)
select geography
       ,count(distinct(case when  pageview between 1 and 5 then SKU end)) as PV_from_0_to_5
       ,count(distinct(case when  pageview between 6 and 10 then SKU end)) as PV_from_6_to_10
from tb1
group by rollup (1)

**Output:**
geography/ PV_from_0_to_5/ PV_from_6_to_10
NULL 1 1
DN 1 0
KR 0 1

**Expected Output**
geography/ PV_from_0_to_5/ PV_from_6_to_10
NULL 0 1
DN 1 0
KR 0 1

Explain: Obviously, I'm counting pageview of a single SKU across two location: DN & KR. The problem is when using group by rollup, I expect the pageview would be aggregated (1+7 = 8 pageviews) before count distinct the SKU. I cannot put a SUM within a count distinct, hence I don't know what else can do.

2

2 Answers

0
votes

I see nothing wrong with this GROUP BY ROLLUP.

You can use ARRAY_AGG() to see what's being counted:

with tb1 as
(
select 'DN' as geography
       ,'1012658993824' as SKU
       ,1 as pageview

union all

select 'KR' as geography
       ,'1012658993824' as SKU
       ,7 as pageview

)
select geography
       ,count(distinct(case when  pageview between 1 and 5 then SKU end)) as PV_from_0_to_5
       ,count(distinct(case when  pageview between 6 and 10 then SKU end)) as PV_from_6_to_10
       , ARRAY_AGG(case when  pageview between 1 and 5 then SKU end IGNORE NULLS) PV_from_0_to_5_agg
       , ARRAY_AGG(case when  pageview between 6 and 10 then SKU end IGNORE NULLS) PV_from_6_to_10_agg
from tb1
group by rollup (geography)

enter image description here

The sample query just happens to use the same id for two different rows.

0
votes

I also do not see anything wrong with the group by rollup as Felipe was saying. In order to provide more clarity I have done this query:

with tb1 as
(
select 'DN' as geography
       ,'1012658993824' as SKU
       ,1 as pageview

union all

select 'KR' as geography
       ,'1012658993825' as SKU
       ,7 as pageview

)
select geography
       ,sum(pageview) as addition
       ,count(pageview) as count
       ,count(distinct(case when  pageview between 1 and 5 then SKU end)) as PV_from_0_to_5
       ,count(distinct(case when  pageview between 6 and 10 then SKU end)) as PV_from_6_to_10
       , ARRAY_AGG(case when  pageview between 1 and 5 then SKU end IGNORE NULLS) PV_from_0_to_5_agg
       , ARRAY_AGG(case when  pageview between 6 and 10 then SKU end IGNORE NULLS) PV_from_6_to_10_agg

from tb1
group by rollup (geography)

You will see that you get addition 8, and count 2 for the first row. That means that actually you have both pageview 1 and pageview 7. Hence, the result you get is the expected.

If you want to get your requested result you need to use the sum of pageviews explicitly. That is:

with tb1 as
(
select 'DN' as geography
       ,'1012658993824' as SKU
       ,1 as pageview

union all

select 'KR' as geography
       ,'1012658993825' as SKU
       ,7 as pageview

)
select geography,sum(pageview) as addition
       ,CAST(sum(pageview)<5 AND sum(pageview)>0 AS INT64) as PV_from_0_to_5
       ,CAST(sum(pageview)>5 AND sum(pageview)<10 AS INT64) as PV_from_6_to_10

from tb1
group by rollup (geography)