1
votes

Whats the problem

I have a very large list of results. I want to filter my query to only include a small sub-set of those based on the most popular results within the table.

When I put my filtering subquery in the WITH clause it doesn't work. But if I put it directly in the WHERE clause it does work. Why?

I'd prefer the first for clarity, but I can't get it to work.

I'm working in BigQuery with StandardSQL.

Example

WITH
  most_common AS (
  SELECT
    page
  FROM
    `mydataset`
  LIMIT 
    100
)
SELECT
  *
FROM
   `mydataset`
WHERE
  page IN most_common

Here I'm trying to get all the results, where the page is in the first 100 rows.

It returns the following error: - Syntax error: Expected "(" or keyword UNNEST but got identifier "most_common" at [12:12]

However if I put the subquery directly in the where it works fine.

SELECT
  *
FROM
   `mydataset`
WHERE
  page IN (
  SELECT
    page
  FROM
    `mydataset`
  LIMIT 
    100
)

My limited understanding

It says it wants unnest, but unnest turns an array to a table result as a sub-query this should already be a table.

2

2 Answers

2
votes
#standardSQL
WITH
  most_common AS (
  SELECT
    page
  FROM
    `mydataset`
  LIMIT 
    100
)
SELECT
  *
FROM
   `mydataset`
WHERE
  page IN (SELECT page FROM most_common)  

Another option:

#standardSQL
WITH most_common AS (
  SELECT ARRAY_AGG(page) pages FROM (
    SELECT page
    FROM `mydataset`
    LIMIT 100
  )
)
SELECT *
FROM `mydataset`, most_common
WHERE page IN UNNEST(pages)   

or slightly refactored version

#standardSQL
WITH most_common AS (
  SELECT ARRAY_AGG(page LIMIT 100) pages 
  FROM `mydataset`
)
SELECT *
FROM `mydataset`, most_common
WHERE page IN UNNEST(pages)  

Note: ARRAY_AGG(...) accepts ORDER BY clause so you can actually pick most common using this syntax ARRAY_AGG(page ORDER BY some metric DESC LIMIT 100)
Obviously final version depends on your real use case - but rather than this - you got the idea I hope

0
votes

When you write query with CTE expression data will be return in table format. You can use CTE expression in sub query to work.

WITH MYSELECT AS (SELECT
  status
FROM
  `bigquery-public-data.austin_311.311_request`
LIMIT 100)
SELECT COUNT(*) FROM `bigquery-public-data.austin_311.311_request` WHERE status IN (SELECT DISTINCT status FROM MYSELECT);