1
votes

I am working on a large BigQuery data set that is auto populated from Google Analytics. For this work, I am trying to calculate an exit rate using the data. Additionally, for reference, I am looking closely at the BigQuery export schema that is provided at the following link which describes the BigQuery export schema in sufficient detail.

As noted in this post from Google regarding the exit rate, exit rate can be defined as "For all pageviews to the page, the exit rate is the percentage that were the last in the session." To calculate this, I surmise that I need, for every unique visit, to check each of the pages in the hits.page.pagePath column and, if the the regular expression which finds an error is preceded by another URL that does not indicate an error, then an exit after the error path can be counted as an exit.

This looks to be a very clear case of path analysis & I am not yet certain if BigQuery can handle it easily or efficiently. In general, the URL that I am interested in looking at which contains the exit uses:

REGEXP_MATCH(hits.page.pagePath, r'/[^/]+error\.aspx')

For example, I started out using this in the following context as a trial:

SELECT hits.page.pagePath AS Page_Path
FROM [XXXXXXXX.ga_sessions_20140711]
WHERE REGEXP_MATCH(hits.page.pagePath, r'/[^/]+error\.aspx') OR REGEXP_MATCH(hits.page.pagePath, r'/[^/]+genericerror\.aspx')

Any suggestions or examples that one can point me to which highlight how someone has used BigQuery successfully to calculate exit rate would be greatly appreciated.

FURTHER DETAILS ADDED (July 14, 2014):

The data set in question here is XXXXXXXX. Here are some further details that were added. The overall query that will be executed will create a table with the following output:

Date Page (FROM CASE - there are multiple cases each day, so each day will have 7 or 8 rows each for the various cases), metric 1, metric 2 (bounce), metric 3 (exit)

The query has the following overall specifications:

Description: Pageviews and unique pageviews for ^/XXX/[^/]+error.aspx OR ^/XXX/[^/]+genericerror.aspx

Metrics: Pageviews, Unique Pageviews, Sessions, Users, Bounce rate (bounces), Exit rate (exits)

Case statement value (Page dimension value): Homepage 2.0, Homepage 1.0, Inbound Search 2.0, Inbound Search 1.0, Outbound Search 2.0, Outbound Search 1.0, Review Itinerary 2.0, Review Itinerary 1.0, Traveler info 2.0, Traveler info 1.0, Seat selector 2.0, Seat selector 1.0, Payment info 2.0, Payment info 1.0, digital 2.0 other, digital 1.0 other

CASE STATEMENT:

Case when previous page = "^/XXX/[^/]+/default\.aspx" and landing page="^/XXX/[^/]+/default\.aspx" then "Homepage 2.0" when previous page ="^/web/[^/]+/default\.aspx" and landing page = "^/web/[^/]+/default\.aspx" then "Homepage 1.0" when previous page="^XXX/[^/]+/apps/booking?flight/(searchresult1|search(rt|ow|md))\.aspx" and landing page="^/XXX/[^/]+/default\.aspx" then "Inbound Search 2.0" when previous page="^web/[^/]+/apps/booking?flight/(searchresult1|search(rt|ow|md))\.aspx" and landing page="^/web/[^/]+/default\.aspx" then "Inbound Search 1.0" when previous page="^/XXX/[^/]+/apps/booking/flight/searchResult2\.aspx" and landing page="^/XXX/[^/]+/default\.aspx" then "Outbound Search 2.0" when previous page="^/web/[^/]+/apps/booking/flight/searchResult2\.aspx" and landing page="^/web/[^/]+/default\.aspx" then "Outbound Search 1.0" when previous page="^/XXX/[^/]+/apps/booking/flight/reviewRevenue\.aspx" and landing page="^/XXX/[^/]+/default\.aspx" then "Review Itinerary 2.0" when previous page="^/web/[^/]+/apps/booking/flight/reviewRevenue\.aspx" and landing page="^/web/[^/]+/default\.aspx" then "Review Itinerary 1.0"  when previous page="^/XXX/[^/]+/apps/booking/flight/traveler\.aspx" and landing page="^/XXX/[^/]+/default\.aspx" then "Traveler info 2.0" when previous page="^/web/[^/]+/apps/booking/flight/traveler\.aspx" and landing page="^/web/[^/]+/default\.aspx" then "Traveler info 1.0"  when previous page="^/XXX/[^/]+/apps/booking/flight/seatSelector\.aspx" and landing page="^/XXX/[^/]+/default\.aspx" then "Seat selector 2.0" when previous page="^/web/[^/]+/apps/booking/flight/seatSelector\.aspx" and landing page="^/web/[^/]+/default\.aspx" then "Seat selector 1.0" when previous page="^/XXX/[^/]+/apps/booking/flight/billingRevenue\.aspx" and landing page="^/XXX/[^/]+/default\.aspx" then "Payment info 2.0" when previous page="^/web/[^/]+/apps/booking/flight/billingRevenue\.aspx" and landing page="^/web/[^/]+/default\.aspx" then "Payment info 1.0" when landing page ="^/XXX/[^/]+/default\.aspx" then "digital 2.0 other" else "digital 1.0 other" end as Page

Dimensions: Date, Page, Device Category, Browser, Custom Variable (Value 04)

Filters: REGEX: page=^/XXX/[^/]+/error.aspx or ^/XXX/[^/]+/genericerror.aspx

Group by / Order by: Date, Page

The final query here must be a single query executed Ad Hoc for a specified number of days which produces the simple table noted above (which sums for the various values). The exit rate looks to be a manual calculation. To start, I used the following table to generate the URL's of the various hits:

SELECT date, CONCAT(fullVisitorId, STRING(visitId)) AS unique_visit_id, visitId, visitNumber, fullVisitorId, totals.pageviews, totals.bounces, 
hits.page.pagePath, hits.page.pageTitle, device.deviceCategory, device.browser, device.browserVersion, hits.customVariables.index,
hits.customVariables.customVarName, hits.customVariables.customVarValue, hits.time
FROM (FLATTEN([XXXXXXXX.ga_sessions_20140711], hits.time))
WHERE hits.customVariables.index = 4
ORDER BY unique_visit_id DESC, hits.time ASC
LIMIT 1000;

I am having issues now using the lag function to look at the previous hits.page.pagePath that was provided to do the case, and then the calculation of the exit rate.

2
Hi Nathan! I've been trying to work this question out, but the query/sample data/expected results would help a lot!Felipe Hoffa
Thanks Felipe! Let me put more details up for the problem. This is part of the three different Stackoverflow posts that I have put up on different aspects of the problem - each causing different challenges.Nathaniel Payne
I just added the full query details. As noted, there are multiple aspects which are problematic and exit rate is just one of the issues. The CASE statement represents arguably an even bigger function with my issues with the LAG() function.Nathaniel Payne
Here is a related question that I also just asked on the lagged part of the query: stackoverflow.com/questions/24746161/…Nathaniel Payne
Hi Felipe. I wanted to follow up and see if you had any insights here. Am still looking for the best way to actually query for the exit rate in BigQuery?Nathaniel Payne

2 Answers

2
votes

I have put this together to give you an extra column that states if a page is the last hit in a session. Important to check for the hit type in this.

Add Column To State If Page Is Last Hit

SELECT 
  unique_visit_id,
  page,
  hit_number,
  hit_type,
  max_hit,
  IF(hit_number = max_hit, 'yes', 'no') as last_page
FROM (SELECT CONCAT(fullVisitorId, STRING(visitId)) AS unique_visit_id, hits.hitNumber AS hit_number, hits.type AS hit_type, hits.page.pagePath AS page, MAX(hit_number) OVER (PARTITION BY unique_visit_id) AS max_hit
FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910]
WHERE hits.type = 'PAGE'
GROUP BY unique_visit_id, hit_number, hit_type, page
ORDER BY unique_visit_id, hit_number)

Get Pageviews, Exits and Exit Rate

This will then give you your actual calculations

SELECT 
  page,
  COUNT(page) as pageviews,
  SUM(IF(hit_number = max_hit, 1, 0)) as exits,
  (SUM(IF(hit_number = max_hit, 1, 0))/COUNT(page)) * 100 AS exit_rate
FROM (SELECT CONCAT(fullVisitorId, STRING(visitId)) AS unique_visit_id, hits.hitNumber AS hit_number, hits.type AS hit_type, hits.page.pagePath AS page, MAX(hit_number) OVER (PARTITION BY unique_visit_id) AS max_hit
FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910]
WHERE hits.type = 'PAGE'
GROUP BY unique_visit_id, hit_number, hit_type, page
ORDER BY unique_visit_id, hit_number)
GROUP BY page
ORDER BY pageviews DESC
0
votes

2021 update

original question was TL;DR
My colleague came here to find a way to calculate exit rate from BigQuery, however @tfayyaz solution didn't work for him.
For those who will search for GA exit rate BigQuery solution, you can find working code below:

SELECT 
  page,
  COUNT(hit_number) AS hit_number,
  SUM(IF(hit_number = max_hit, 1, 0)) as exit_count,
  SUM(IF(hit_number = max_hit, 1, 0)) / COUNT(hit_number) AS ex_rate
FROM (
    SELECT CONCAT(fullVisitorId, CAST(visitId AS STRING)) AS unique_visit_id, 
        hits.page.pagePath AS page,
        hits.hitNumber AS hit_number, 
        MAX(hits.hitNumber) OVER(PARTITION BY CONCAT(fullVisitorId, CAST(visitId AS STRING))) AS max_hit
    FROM `[YOUR_PROJECT].[YOUR_VIEW_ID].ga_sessions_[YOUR_DATE]`, UNNEST (hits) AS hits
    WHERE hits.type = 'PAGE')
GROUP BY page
ORDER BY hit_number DESC

The logic here is as follows:

SELECT 
  page, -- 8. get all the pages
  COUNT(hit_number) AS pageviews, -- 10. sum the pageviews (optional)
  SUM(IF(hit_number = max_hit, 1, 0)) as exit_count, -- 11. sum the exits (optional)
  SUM(IF(hit_number = max_hit, 1, 0)) / COUNT(hit_number) AS ex_rate -- 12. calculate exit rate
FROM ( -- 7. from this subquery

-- subquery

    SELECT CONCAT(fullVisitorId, CAST(visitId AS STRING)) AS unique_visit_id, -- 3. get unique user+session id
        hits.page.pagePath AS page, -- 4. get every page visited during these sessions
        hits.hitNumber AS hit_number, -- 5. get hit count for them
        MAX(hits.hitNumber) OVER(PARTITION BY CONCAT(fullVisitorId, CAST(visitId AS STRING))) AS max_hit -- 6. then append the last pageview number for every user+session id respectively
    FROM `[YOUR_PROJECT].[YOUR_VIEW_ID].ga_sessions_[YOUR_DATE]`, UNNEST (hits) AS hits -- 1. from your GA view data
    WHERE hits.type = 'PAGE') -- 2. for pageview hits only

--

GROUP BY page -- 9. group the results by them
ORDER BY hit_number DESC -- 13. sort the result like in GA Exit Pages report