Intersection on big amount of data without BigQuery

Question

I have a table (in google BigQuery) showing url visited by people. People are represented by a 10 char id.

If a user visited an url once, there will be 1 line in the table. There are around 90M unique people (ids) and around 400K unique domain.

My goal is to get for each domain, the number of unique people who visited it. Result will be shown in an interface where a user will be able to choose or not a domain and to see the total amount of people selected ( so the sum of unique ids who visited the domains he chose ).

The thing is, some people may have visited multiple domains. So the total sum will be wrong. I have a version where I just get the number of unique ids that visited a domain for each domain, and then in the interface I just add to the total when a website is selected and I remove from the total when a domain is not choosen. Of course, this doesn't solve the problem of ids being counting twice.

The big amount of domains makes it impossible to just calculate every intersection possible. Also I want to query BigQuery only once for speed and cost reasons. I feel like there is no real solution without using BigQuery after each selection, can anyone tell me if i missed something ?

Thank you

Martin Weitzmann Martin Weitzmann · Accepted Answer · 2019-12-17T11:11:33

I think you're looking for the ROLLUP functionality in GROUP BY: https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax#group-by-clause

Example:

WITH Sales AS (
  SELECT  1 AS day, 'abc' AS user UNION ALL
  SELECT  1, 'abc' UNION ALL
  SELECT  1, 'def' UNION ALL
  SELECT  2, 'abc' UNION ALL
  SELECT  3, 'abc' UNION ALL
  SELECT  3, 'def' UNION ALL
  SELECT  3, 'abc'
)
SELECT
  day,
  COUNT(distinct user) AS total
FROM Sales
GROUP BY ROLLUP(day);

Intersection on big amount of data without BigQuery

1 Answers