BigQuery - Complex Correlated query

Question

I am trying to query Google BigQuery public Reddit Dataset. My goal is to compute the similarity of subreddits using Jaccards' Index , which is defined by:

My plan is to select the top N=1000 subreddits in terms of number of comments in August 2016. Then compute their cartesian product to get the combinations of all the subreddits in a subreddit1, subreddit2 shape.

Then use those rows of combinations to query the union of users between subreddit1 and subreddit 2 as well as the intersection.

The query I have so far is this:

SELECT 
  subreddit1,
  subreddit2,
  (SELECT 
    COUNT(DISTINCT author)
  FROM `fh-bigquery.reddit_comments.2016_08`
  WHERE subreddit = subreddit1
    OR subreddit = subreddit2
  LIMIT 1
  ) as subreddits_union,

  (
    SELECT 
      COUNT(DISTINCT author)
    FROM `fh-bigquery.reddit_comments.2016_08`
    WHERE subreddit = subreddit1
    AND author IN ( 
       SELECT author 
       FROM `fh-bigquery.reddit_comments.2016_08`
       WHERE subreddit= subreddit2
       GROUP BY author 
      ) as subreddits_intersection

FROM

(SELECT a.subreddit as subreddit1, b.subreddit as subreddit2
 FROM  (
   SELECT subreddit, count(*) as n_comments
   FROM `fh-bigquery.reddit_comments.2016_08`
   GROUP BY subreddit
   ORDER BY n_comments DESC
   LIMIT 1000
   ) a
 CROSS JOIN (
   SELECT subreddit, count(*) as n_comments
   FROM `fh-bigquery.reddit_comments.2016_08`
   GROUP BY subreddit
   ORDER BY n_comments DESC
   LIMIT 1000
   ) b
 WHERE a.subreddit < b.subreddit
  )

Which ideally would give the results:

subreddit1, subreddit2, subreddits_union, subreddits_interception
-----------------------------------------------------------------
   Art     |  Politics |      50000      |      21000
   Art     |  Science  |      92320      |      15000
   ...     |  ...      |      ...        |      ...

However, this query gives me the following BigQuery error: Error: Correlated subqueries that reference other tables are not supported unless they can be de-correlated, such as by transforming them into an efficient JOIN.

Which I understand. However I don't think this query can be translated into an efficient join. Given that BQ does not have an apply method, is there any way this query could be set up without resorting to individual queries? Maybe with a PARTITION BY ?

Alex227 Alex227 · Accepted Answer · 2016-10-04T20:29:08

Thanks for your answer. This one works pretty well in returning the subreddit union , however, how would you implement the intersection ?

Maybe something along the lines of

WITH top_most AS (
   SELECT subreddit, count(*) as n_comments
   FROM `fh-bigquery.reddit_comments.2016_08`
   GROUP BY subreddit
   ORDER BY n_comments DESC
   LIMIT 20
),
authors AS (
  SELECT DISTINCT author, subreddit
  FROM `fh-bigquery.reddit_comments.2016_08`
)
SELECT 
count(DISTINCT a1.author),
subreddit1, subreddit2
FROM
(
  SELECT t1.subreddit subreddit1, t2.subreddit subreddit2
  FROM top_most t1 CROSS JOIN top_most t2 LIMIT 1000000
)
INNER JOIN authors a1 on a1.subreddit = subreddit1
INNER JOIN authors a2 on a2.subreddit = subreddit2
WHERE a1.author = a2.author
GROUP BY subreddit1, subreddit2
ORDER BY subreddit1, subreddit2

BigQuery - Complex Correlated query

2 Answers