I'm trying to understand handling of null values in Hive, particularly when it comes to checking for null values in a Python transform.
I have been reading this wiki page about transforms:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Transform
Specifically I'm looking at this second paragraph:
By default, columns will be transformed to STRING and delimited by TAB before feeding to the user script; similarly, all NULL values will be converted to the literal string \N in order to differentiate NULL values from empty strings. The standard output of the user script will be treated as TAB-separated STRING columns, any cell containing only \N will be re-interpreted as a NULL, and then the resulting STRING column will be cast to the data type specified in the table declaration in the usual way. User scripts can output debug information to standard error which will be shown on the task detail page on hadoop. These defaults can be overridden with ROW FORMAT ....
But I guess I'm confused when it says "all NULL values", how does it determine if the value is NULL? Is it smart enough to take into account the "serialization.null.format" table property?
If I understand the docs correctly, I think the default behavior is something like this:
Value to be stored to disk => Value actually stored to disk => Value sent to Python via TRANSFORM
Note \N is not an escape sequence, but literal two-char string backslash capital-n
- (empty string) => (empty string) => "
\N" - (null value) => (empty string) => "
\N" - 'NULL' => 'NULL' => 'NULL'
- "
\N" => "\N" => "\N"
Any corrections to that understanding would be awesome. :)
I'm not dealing with the default situation though. I am working on a system where all the tables have had this table property set:
TBLPROPERTIES ('serialization.null.format'='NULL')
It's not at all clear to me from the docs how that will effect how these values get stored to disk and how they will be translated during TRANSFORM and sent to Python:
Value to be stored to disk => Value actually stored to disk => Value sent to Python via TRANSFORM
Again, \N is not an escape sequence, but literal two-char string backslash capital-n.
- (empty string) => (empty string) => ???
- (null value) => 'NULL' => ???
- 'NULL' => 'NULL' => ???
- "\N" => "\N" => "\N"
So my primary question is, can anyone provide some insight/clarity into that second scenario? I need to know how those values will show up in Python, so that I can correctly check to see if the value is null.