4
votes

Wanted to ask / discuss with you about null values behaviour in BigQuery.

I have noticed that filtering out real values in a NULLABLE column, will results filtering out both the value requested and NULL values.

Take this query for ex:

select * from
(select NULL as some_nullable_col, "name1" as name),
(select 4 as some_nullable_col, "name2" as name),
(select 1 as some_nullable_col, "name3" as name),
(select 7 as some_nullable_col, "name4" as name),
(select 3 as some_nullable_col, "name5" as name)
--WHERE some_nullable_col != 3

All results return as expected,

And then:

select * from
(select NULL as some_nullable_col, "name1" as name),
(select 4 as some_nullable_col, "name2" as name),
(select 1 as some_nullable_col, "name3" as name),
(select 7 as some_nullable_col, "name4" as name),
(select 3 as some_nullable_col, "name5" as name)
WHERE some_nullable_col != 3

will omit 2 columns. the value 3 and null.

I guess this happens because BigQuery won't index null values / won't scan null values on where clause for efficiency, but it also brings troubles:

Each time I filter on a nullable column, the filter will look like WHERE some_nullable_col != 3 OR some_nullable_col IS NULL

This is obviously less comfortable.

Just wanted to get an explanation / does BigQuery's roadmap offers a fix for this issue?

4

4 Answers

6
votes

This is standard behavior of NULL in SQL, and all SQL databases (Oracle, Microsoft SQL Server, PostgreSQL, MySQL etc) have exactly same behavior. If the IS NULL check is too tedious, alternative solution is to use IFNULL or COALESCE function to convert NULL into non-NULL, i.e.

select * from
(select NULL as some_nullable_col, "name1" as name),
(select 4 as some_nullable_col, "name2" as name),
(select 1 as some_nullable_col, "name3" as name),
(select 7 as some_nullable_col, "name4" as name),
(select 3 as some_nullable_col, "name5" as name)
WHERE ifnull(some_nullable_col,0) != 3
4
votes

Yes, you are right that NULL is not matched to the comparator like some_nullable_col != 3. Let me explain the reason why.

Google are using a key value store as an underlying data storage for BigQuery. Unlike traditional relational database, data are fragmented by the row and fields and stored into many different locations. If the data is NULL, BigQuery consider the data does not exist, hence nothing is written to the data storage. As such, that filed will never be matched with any comparators except for "IS NULL". This is by design and Google don't have any plans to change the way it works at the moment.

The workaround for this is to set the special value for those fields. For example, if the type of that field is string, then you can use the null string "" instead of NULL. If the field type is non-negative integer, you can use "-1" as a special value. I understand that this is not really optimal, and it could be better to add "IS NULL" statement in your query in many situation. This is just to give you another option.

By the way, I tried the similar thing on my MySQL instance, and the way it behaves is the same as BigQuery. Namely, the query does not return NULL records with "=!" comparator.

For example,

mysql> select * from test1;
+------+------------+
| id   | num        |
+------+------------+
|    0 | aaa        |
|    1 | bbb        |
|    8 | sdfsdfgsdf |
|    9 | NULL       |
| NULL | sdfsdfsfsf |
+------+------------+
5 rows in set (0.19 sec)

and

mysql> select * from test1 where id != 8;
+------+------+
| id   | num  |
+------+------+
|    0 | aaa  |
|    1 | bbb  |
|    9 | NULL |
+------+------+
3 rows in set (0.18 sec)

So I think this is a standard behavior in SQL's world.

1
votes

Just to add into the pile :o)

In some cases below option can be useful

SELECT * FROM
(SELECT NULL AS some_nullable_col, "name1" AS name),
(SELECT 4 AS some_nullable_col, "name2" AS name),
(SELECT 1 AS some_nullable_col, "name3" AS name),
(SELECT 7 AS some_nullable_col, "name4" AS name),
(SELECT 3 AS some_nullable_col, "name5" AS name)
WHERE IFNULL(some_nullable_col != 3, true)

Consider for example case when your nullable field is of string type
In this case you will need to do only one change here -

WHERE IFNULL(some_nullable_col != '3', true)

whereas if you use IFNULL directly on nullanble field like below

WHERE IFNULL(some_nullable_col, 0) != 3

you will need to reflect it not only for '3' but also for '0' so extra thing to take care of

Of course, at the end of the day, all is the same and it is just matter of preferences but some times actually depends on specific use and implementation pattern

As of being standard behavior - BigQuery so far is so far from following standards - and isn't this why we all so in love with it?!

0
votes

null is a special value. Many expressions with null return null itself, including a not equal predicate expression. This is a property of null and is by design. If you want to include null in your result you should explicitly allow it, for example with a IS NULL check,

Your query then becomes:

select * from
…
WHERE (some_nullable_col != 3 OR some_nullable_col IS NULL)

There are plenty resources about this online, for example on Wikipedia.