Incorrect and inconsistent user counts using Google Core Reporting API and bigquery when using variable/dimension filtering

Question

Background: I have app and web data, some of my apps (new iOS versions) use GA dimensions and the rest (Android and web) use GA custom variables.

So firstly, I'm currently trying to replicate this query in BigQuery in the Query Explorer to get simple user counts over a defined date for my web users only:

select count(distinct fullvisitorid, 10000000) as users
 from table_date_range([12345678.ga_sessions_],    
 timestamp('2015-02-01'), timestamp('2015-03-01')) 
  where hits.customvariables.customvarvalue like '%web%'

I get around 5.34m users. This corresponds to what I see in Google Analytics. I am confident this figure is correct.

If I go into the Query Explorer and apply no filters (so I include my app and web users) I get 5.70m users. Again, this corresponds to Google Analytics and we're confident this figure is correct, web makes up the majority of our traffic.

If I run another query in Query Explorer but this time apply the filter:

ga:customVarValue1=@web

I get 8.73m users. So I have more users after applying the filter than without... obviously this isn't correct and has something to do with how the Query Explorer is applying the filter post aggregation.

Note: When I run this query in BigQuery:

select sum(users)
from (
   select count(distinct fullvisitorid, 1000000) as users,
    hits.customvariables.customvarvalue as platform
   from table_date_range([12345678.ga_sessions_],    
    timestamp('2015-02-01'), timestamp('2015-03-01')) 
   group each by 2)
    where platform like '%web%'

I get 8.73m users. Almost the exact same number as I get when applying the filter in Query Explorer, the difference I get of around 1% can be explained by the sampling. I've tested it on multiple dates so I'm sure this is what's happening. Applying the filter post aggregation instead of pre (as in my first BigQuery query) leads to a higher number of users because we had two web releases in this timeframe. So all users are being counted once for every version of web they used.

To add: One of the developers on my team wrote some Python script back in February which replicated the first BigQuery code written above (a simple user count where the variable=web) but instead hits the Core Reporting API and requests an unsampled report. Until March 5th 2015 the number of users we got using BigQuery versus the Python script were almost identical (difference of 1% due to sampling). Then on March 5th they began to diverge, even for historical user counts, and instead our Python script started producing counts similar to the Query Explorer (filters being applied post aggregation instead of pre).

My question(s) are: 1. What changed on March 5th? 2. How do we replicate in Query explorer the first BigQuery code above? Are we applying the variable filter correctly? 3. How do we replicate the BigQuery code in our Python script which hits the Core reporting API?

Lastly: When in Query Explorer I ask for user counts over a given date and instead use a dimension filter:

ga:dimension2=@ios

I get around 50% LESS than I get in BigQuery when running:

select count(distinct fullvisitorid, 10000000) as users
 from table_date_range([12345678.ga_sessions_],    
 timestamp('2015-02-01'), timestamp('2015-03-01')) 
  where hits.customdimensions.value like '%ios%'

If the filter was being applied post aggregation as it is when filtering using variables then I would get a higher user count, not less. I seriously cannot explain what the Query Explorer is doing in order to give me substantially lower counts when filtering on dimensions.

Please halp

Tom Tom · Accepted Answer · 2015-05-08T03:28:08

I don't have an answer for you, but since you're using BigQuery, I assume you are a Premium customer? If so, you can open a ticket with their support team - they should get back to you quickly.

Incorrect and inconsistent user counts using Google Core Reporting API and bigquery when using variable/dimension filtering

1 Answers