1
votes

In BigQuery, there is the Google Analytics based query as is stated below and this works correctly.

#standard sql   
SELECT
      Date,
      SUM(totals.visits) AS Sessions,
      SUM(totals.transactions) AS Transactions
    FROM
      `[projectID].[DatasetID].ga_sessions_*`
    WHERE
      _TABLE_SUFFIX BETWEEN '20181217'
      AND '20181217'
      AND totals.visits > 0
    GROUP BY
      Date

In this query, I need to exclude all hits where within a hit...

  • ..GA custom dimension #23 (hit-scope) contains value 'editor' OR
  • ..GA custom dimension #6 (product-scope) matches regular expression value '^63.....$' OR
  • ..GA hits.page.pagePath matches regular expression value 'gebak|cake'

Note: it is not the intention to apply the 3 conditions as stated above on session-level (as in this screenshot) but on hit-level, since I'd like to reproduce numbers from another GA view than the view from which the data is loaded to BigQuery. In this other GA view the 3 conditions as are stated above are set as view filters.

The 'best' query thus far is the one below (based on the post of Martin Weitzmann below). However, the dataset is not filtered in this query (in other words, the conditions do not work).

SELECT Date, 
-- hits,
SUM(totals.transactions), 
SUM(totals.visits) 

FROM (

(
  SELECT date, totals,
    -- create own hits array
    ARRAY(
      SELECT AS STRUCT 
        hitnumber, 
        page,
        -- create own product array
        ARRAY(
          SELECT AS STRUCT productSku, productQuantity 
          FROM h.product AS p
          WHERE (SELECT COUNT(1)=0 FROM p.customDimensions WHERE index=6 AND value like '63%') 
        ) AS product
      FROM t.hits as h
      WHERE 
        NOT REGEXP_CONTAINS(page.pagePath,r'gebak|cake')
        AND
        (SELECT COUNT(1)=0 FROM h.customDimensions WHERE index=23 AND value like '%editor%')
    ) AS hits
  FROM
    `[projectID].[DatasetID].ga_sessions_*` t
  WHERE 
  _TABLE_SUFFIX BETWEEN '20181217'
  AND '20181217'
  AND totals.visits > 0
  ))
  GROUP BY Date

Does anyone know how to achieve the desired output?

Thanks a lot in advance!

Note: the projectID and datasetID have been masked in both queries because of privacy concerns.

2
can't you add those condition in where clause using "REGEXP_EXTRACT" function? - MJK
@MJK, I don't think so, see for my explanation my comment in the post below. - Timo Rietveld

2 Answers

2
votes

Own arrays approach

You can create your own hits and product arrays by using sub-queries on the original and feeding their output back into array functions. In those subqueries you can filter out your hits and products:

#standardsql
SELECT
  date,
  hits
  --SUM(totals.visits) AS Sessions,
  --SUM(totals.transactions) AS Transactions
FROM
  (
  SELECT 
    date, totals,
    -- create own hits array
    ARRAY(
      SELECT AS STRUCT 
        hitnumber, 
        page,
        -- create own product array
        ARRAY(
          SELECT AS STRUCT productSku, productQuantity 
          FROM h.product AS p
          WHERE (SELECT COUNT(1)=0 FROM p.customDimensions WHERE index=6 AND value like '63%') 
        ) AS product
      FROM t.hits as h
      WHERE 
        NOT REGEXP_CONTAINS(page.pagePath,r'gebak|cake')
        AND
        (SELECT COUNT(1)=0 FROM h.customDimensions WHERE index=23 AND value like '%editor%')
    ) AS hits
  FROM
    `bigquery-public-data.google_analytics_sample.ga_sessions_20161104` t
  )
--GROUP BY 1
LIMIT 100

I left this example in an ungrouped state, but you can easily adjust it by commenting out the hits and group accordingly ...

Segmentation approach

I think you just need the right sub-query in your WHERE statement:

#standardsql
SELECT
  date,
  SUM(totals.visits) AS Sessions,
  SUM(totals.transactions) AS Transactions
FROM
  `bigquery-public-data.google_analytics_sample.ga_sessions_*` t
WHERE
  (SELECT COUNT(1)=0 FROM t.hits h
    WHERE 
      (SELECT count(1)>0 FROM h.customDimensions WHERE index=23 AND value like '%editor%')
      OR
      (SELECT count(1)>0 from h.product p, p.customdimensions cd WHERE index=6 AND value like '63%')
      OR
      REGEXP_CONTAINS(page.pagePath,r'gebak|cake')
  )
GROUP BY date

Since all your groups are on session level, you don't need any flattening (resp. cross joins with arrays) on the main table, which is costly. In your outermost WHERE you enter the hits array with a subquery - it's like a for-each on rows. Here you can already count occasions of REGEXP_CONTAINS(page.pagePath,r'gebak|cake').

For the other cases, you write a subquery again to enter the respective array - in the first case, customDimensions within hits. This is like a nested for-each inside the other one (subquery in a subquery).

In the second case, I'm simply flattening - but within the subquery only: product with its customDimensions. So this is a one-time nested for-each as well because I was lazy and cross-joined. I could've written another Subquery instead of the cross-join, so basically a triple-nested for-each (subquery in a subquery in a subquery).

Since I'm counting cases that I want to exclude, my outer condition is COUNT(1)=0.

I could only test it with ga sample data .. so it's kind of untested. But I guess you get the idea.

1
votes

Just a quick example/idea on how to use WITH and REGEXP_EXTRACT on a public set

WITH CD6 AS (
SELECT cd.value, SUM(totals.visits) AS Sessions6Sum
FROM
  `bigquery-public-data.google_analytics_sample.ga_sessions_*`,
   UNNEST(hits) AS hits,
   UNNEST(hits.product) AS prod,
   UNNEST(prod.customDimensions) AS cd
   WHERE cd.index=6 
   AND NOT REGEXP_CONTAINS(cd.value,r'^63.....$')
   GROUP BY cd.value
),
CD23 AS (
SELECT cd.value, SUM(totals.visits) AS Sessions23Sum
FROM
  `bigquery-public-data.google_analytics_sample.ga_sessions_*`,
   UNNEST(hits) AS hits,
   UNNEST(hits.product) AS prod,
   UNNEST(prod.customDimensions) AS cd
   WHERE cd.index=23 
   AND NOT REGEXP_CONTAINS(cd.value,r'editor')
   GROUP BY cd.value
)

select CD6.Sessions6Sum + CD23.Sessions23Sum from CD6, CD23

You can get more information on how to use REGEXP_EXTRACT in bigQuery official API page