3
votes

This query:

SELECT x 
FROM dataset.table_a 
WHERE x NOT IN (SELECT x FROM dataset.table_b)

returns zero records even though:

  • Field x in table_a contains 1,326,932 distinct string values

  • Field x in table_b contains 18,885 distinct string values

I do not understand why. Moreover, in BigQuery legacy SQL, this query returns the right answer.

2
Can you give an example of the data? Is it padded? If you change NOT IN to IN what do you get for a result?Jacob H
The migration guide has now been updated to document the difference between legacy and standard SQL.Elliott Brossard

2 Answers

3
votes

I suspect that I know the answer--it's due to the incorrect handling of NULL in relation to NOT IN when using legacy SQL, whereas the behavior with standard SQL agrees with the SQL standard. There is a documentation bug open for this to add it to the migration guide but it hasn't been resolved yet.

The documentation for IN (https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#in-operators) states:

IN with a NULL in the IN-list can only return TRUE or NULL, never FALSE

You can achieve the desired behavior with this query, using NOT EXISTS instead:

SELECT x
FROM dataset.table_a AS t
WHERE NOT EXISTS (
  SELECT 1 FROM dataset.table_b
  WHERE t.x = x
);
3
votes

To minimize changes in your originalquery you can just add WHERE NOT x IS NULL as below

#standardSQL
SELECT x 
FROM `dataset.table_a` 
WHERE x NOT IN (SELECT x FROM `dataset.table_b` WHERE NOT x IS NULL)  

Also, I recommend adding DISTINCT as below to optimize it a little

#standardSQL
SELECT x 
FROM `dataset.table_a` 
WHERE x NOT IN (SELECT DISTINCT x FROM `dataset.table_b` WHERE NOT x IS NULL)