0
votes

I'm having an issue calculating unique users in our GA BigQuery export. I've reproduced the same error using the sample data.

SELECT sum(users) as users, sum(sessions) as sessions FROM (
  SELECT
    h.page.pagePath as page_path,
    trafficSource.source,
    trafficSource.medium,
    COUNT(DISTINCT(fullVisitorId)) AS users,
    COUNT(*) as sessions
  FROM
    `bigquery-public-data.google_analytics_sample.ga_sessions_20170101`, UNNEST(hits) h
  WHERE h.page.pagePath = "/home"
  GROUP BY page_path, source, medium
)
UNION ALL
SELECT sum(users) as users, sum(sessions) as sessions FROM (
  SELECT
    h.page.pagePath as page_path,
    COUNT(DISTINCT(fullVisitorId)) AS users,
    COUNT(*) as sessions
  FROM
    `bigquery-public-data.google_analytics_sample.ga_sessions_20170101`, UNNEST(hits) h
  WHERE h.page.pagePath = "/home"
  GROUP BY page_path
)

When I include the source and medium columns, the distinct fullVisitorId count is 10 higher than without them. How does including these columns cause an increased number of fullVisitorIds? This doesn't make sense to me.

What's causing this and how would I get an accurate count?

1

1 Answers

1
votes

How does including these columns cause an increased number of fullVisitorIds? This doesn't make sense to me.

You can see why if you run your inner query like this:

SELECT
    MAX(fullVisitorId) AS fullVisitorId,
    h.page.pagePath as page_path,
    trafficSource.source,
    trafficSource.medium,
    COUNT(DISTINCT(TRIM(fullVisitorId))) AS users,
    COUNT(*) as sessions
  FROM
    `bigquery-public-data.google_analytics_sample.ga_sessions_20170101`, UNNEST(hits) h
  WHERE h.page.pagePath = "/home"
  and fullVisitorId = '9902321252073939460'
  GROUP BY page_path, source, medium

Which return this results:

enter image description here

As you can see because a user is coming from 2 different source/medium you are counting the same user twice which cause the increase.

One option to solve this is to use aggregate function on source/medium and remove them from the GROUP BY like this:


    SELECT sum(users) as users, sum(sessions) as sessions FROM (
      SELECT
        h.page.pagePath as page_path,
        MAX(trafficSource.source) as source,
        MAX(trafficSource.medium) as medium,
        COUNT(DISTINCT(TRIM(fullVisitorId))) AS users,
        COUNT(*) as sessions
      FROM
        `bigquery-public-data.google_analytics_sample.ga_sessions_20170101`, UNNEST(hits) h
      WHERE h.page.pagePath = "/home"
      GROUP BY page_path
    )
    UNION ALL
    SELECT sum(users) as users, sum(sessions) as sessions FROM (
      SELECT
        h.page.pagePath as page_path,
        COUNT(DISTINCT(TRIM(fullVisitorId))) AS users,
        COUNT(*) as sessions
      FROM
        `bigquery-public-data.google_analytics_sample.ga_sessions_20170101`, UNNEST(hits) h
      WHERE h.page.pagePath = "/home"
      GROUP BY page_path
    )

Now the number of users is the same:

enter image description here