I am trying to query Google BigQuery public Reddit Dataset. My goal is to compute the similarity of subreddits using Jaccards' Index , which is defined by:
My plan is to select the top N=1000 subreddits in terms of number of comments in August 2016. Then compute their cartesian product to get the combinations of all the subreddits in a subreddit1, subreddit2 shape.
Then use those rows of combinations to query the union of users between subreddit1 and subreddit 2 as well as the intersection.
The query I have so far is this:
SELECT
subreddit1,
subreddit2,
(SELECT
COUNT(DISTINCT author)
FROM `fh-bigquery.reddit_comments.2016_08`
WHERE subreddit = subreddit1
OR subreddit = subreddit2
LIMIT 1
) as subreddits_union,
(
SELECT
COUNT(DISTINCT author)
FROM `fh-bigquery.reddit_comments.2016_08`
WHERE subreddit = subreddit1
AND author IN (
SELECT author
FROM `fh-bigquery.reddit_comments.2016_08`
WHERE subreddit= subreddit2
GROUP BY author
) as subreddits_intersection
FROM
(SELECT a.subreddit as subreddit1, b.subreddit as subreddit2
FROM (
SELECT subreddit, count(*) as n_comments
FROM `fh-bigquery.reddit_comments.2016_08`
GROUP BY subreddit
ORDER BY n_comments DESC
LIMIT 1000
) a
CROSS JOIN (
SELECT subreddit, count(*) as n_comments
FROM `fh-bigquery.reddit_comments.2016_08`
GROUP BY subreddit
ORDER BY n_comments DESC
LIMIT 1000
) b
WHERE a.subreddit < b.subreddit
)
Which ideally would give the results:
subreddit1, subreddit2, subreddits_union, subreddits_interception
-----------------------------------------------------------------
Art | Politics | 50000 | 21000
Art | Science | 92320 | 15000
... | ... | ... | ...
However, this query gives me the following BigQuery error:
Error: Correlated subqueries that reference other tables are not supported unless they can be de-correlated, such as by transforming them into an efficient JOIN.
Which I understand. However I don't think this query can be translated into an efficient join. Given that BQ does not have an apply method, is there any way this query could be set up without resorting to individual queries? Maybe with a PARTITION BY ?
