BigQuery standard SQL query returns a wrong answer

Question

This query:

SELECT x 
FROM dataset.table_a 
WHERE x NOT IN (SELECT x FROM dataset.table_b)

returns zero records even though:

Field x in table_a contains 1,326,932 distinct string values
Field x in table_b contains 18,885 distinct string values

I do not understand why. Moreover, in BigQuery legacy SQL, this query returns the right answer.

Can you give an example of the data? Is it padded? If you change NOT IN to IN what do you get for a result? — Jacob H
The migration guide has now been updated to document the difference between legacy and standard SQL. — Elliott Brossard

Elliott Brossard Elliott Brossard · Accepted Answer · 2017-05-04T13:27:33

I suspect that I know the answer--it's due to the incorrect handling of NULL in relation to NOT IN when using legacy SQL, whereas the behavior with standard SQL agrees with the SQL standard. There is a documentation bug open for this to add it to the migration guide but it hasn't been resolved yet.

The documentation for IN (https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#in-operators) states:

IN with a NULL in the IN-list can only return TRUE or NULL, never FALSE

You can achieve the desired behavior with this query, using NOT EXISTS instead:

SELECT x
FROM dataset.table_a AS t
WHERE NOT EXISTS (
  SELECT 1 FROM dataset.table_b
  WHERE t.x = x
);

BigQuery standard SQL query returns a wrong answer

2 Answers