1
votes

I'm using big query and am trying to import custom dimensions along with noncustom dimensions. The analytics is sent from an app and basically I want a table with columns: UserID (custom dimension), platformID (custom dimension), ScreenName (basically app version of "Page name"), and date. The metric is "number of screenviews" grouped onto all of these dimensions. This is what it looks like below:

The photo of the GA report:

enter image description here

So, in bigquery, I could get numbers that checked out (when compared to GA report above) until I added in custom dimensions. Once I added custom dimensions, the numbers no longer made any sense.

I know that custom dimensions are nested within big query. So I made sure to use FLATTEN at first. Then I tried without flatten and got same results. The numbers make no sense (are hundreds of times larger than in GA interface).

My queries are below (one without FLATTEN and one with FLATTEN).

ps I ideally wanted to use

count(hits)

instead of

   count(hits.appInfo.screenName)

But I kept getting an error when I selected hits in my subquery.

My query without flatten is below. If you could help me figure out why is it that once I add custom dimensions all data gets messed up

  SELECT
  date,
  hits.appInfo.version,
  hits.appInfo.screenName,
  UserIdd,
  platform,
 count(hits.appInfo.screenName)


FROM (
          SELECT
          date,
          hits.appInfo.version,
          hits.appInfo.screenName,
          max(case when hits.customdimensions.index = 5 then hits.customdimensions.value end) within record as UserIdd,

          max(case when hits.customdimensions.index = 20 then hits.customdimensions.value end) within record as platform



          FROM
            TABLE_DATE_RANGE([fiery-cabinet-97820:87025718.ga_sessions_], TIMESTAMP('2017-04-04'), TIMESTAMP('2017-04-04'))

              )

  where UserIdd is not null
  and platform = 'Android'

GROUP BY
  1,
  2,
  3,
  4,
  5
ORDER BY
  6 DESC

and here is my query with FLATTEN (same issue - numbers dont make sense)

SELECT
date,
hits.appInfo.version,
  customDimensions.index,
  customDimensions.value,
  hits.appInfo.screenName,
   UserIdd,
 count(hits.appInfo.screenName)

FROM (FLATTEN(( FLATTEN((
          SELECT
          date,
          hits.appInfo.version,
            customDimensions.value,
            customDimensions.index,
            hits.appInfo.screenName,
          max(case when hits.customdimensions.index = 5 then hits.customdimensions.value end) within record as UserIdd,
            hits.type

          FROM
            TABLE_DATE_RANGE([fiery-cabinet-97820:87025718.ga_sessions_], TIMESTAMP('2017-04-04'), TIMESTAMP('2017-04-04'))), customDimensions.value)),hits.type))

WHERE
  customDimensions.value = 'Android'
  and customDimensions.index = 20
  and UserIdd is not null


GROUP BY
  1,
  2,
  3,
  4,
  5,
  6
ORDER BY
  7 DESC
1
Why is there mysql tag on this question?Stijn de Witt

1 Answers

6
votes

I'm not positive that hits.customDimensions.* will always have the user-scoped dimensions (and I'm guessing your userId metric is user-scoped).

Specifically, user-scoped dimensions should be queried from customDimensions, not hits.customDimensions.

Notionally, the first step is to make customDimensions compatible with hits.* via flattening or scoped aggregation. I'll explain the flattening approach.

GA records have the shape (customDimensions[], hits[], ...), which is no good for querying both fields. We begin by flattening these to (customDimensionN, hits[], ...).

One level up, by selecting fields under hits.*, we implicitly flatten the table into (customDimensionN, hitN) records. We filter these to include only the records matching (customDimension5, appviewN).

The last step is to count everything up.

SELECT date, v, sn, uid, COUNT(*) 
FROM (
    SELECT
      date,
      hits.appInfo.version v,
      hits.appInfo.screenName sn,
      customDimensions.value uid
    FROM 
      FLATTEN((
          SELECT customDimensions.*, hits.*, date
          FROM
            TABLE_DATE_RANGE(
              [fiery-cabinet-97820:87025718.ga_sessions_], 
              TIMESTAMP('2017-04-04'),
              TIMESTAMP('2017-04-04'))),
      customDimensions)
WHERE hits.type = "APPVIEW" and customDimensions.index = 5)
GROUP BY 1,2,3,4
ORDER BY 5 DESC

Here's another equivalent approach. This uses the scoped aggregation trick that I've seen recommended in the GA BQ cookbook. Looking at the query explanation, however, the MAX(IF(...)) WITHIN RECORD seems to be quite expensive, triggering an extra COMPUTE and AGGREGATE phase in the first stage. Bonus points for being a bit more digestible, though.

SELECT sn, uid, date, v, COUNT(*) 
FROM (
    SELECT
      MAX(IF(customDimensions.index = 5, customDimensions.value, null)) within record as uid,
      hits.appInfo.screenname as sn,
      date,
      hits.appInfo.version as v,
      hits.type
    FROM 
      TABLE_DATE_RANGE([fiery-cabinet-97820:87025718.ga_sessions_], TIMESTAMP('2017-04-04'), TIMESTAMP('2017-04-04')))
WHERE hits.type = "APPVIEW" and uid is not null
GROUP BY 1,2,3,4
ORDER BY 5 DESC

I'm not yet familiar with the Standard SQL dialect of BQ, but it seems that it would simplify this kind of wrangling. You might want to wrap your head around that if you'll be making many queries like this.