How "transform" affect stored to disk and read from disk values?

Question

I'm trying to understand handling of null values in Hive, particularly when it comes to checking for null values in a Python transform.

I have been reading this wiki page about transforms:

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Transform

Specifically I'm looking at this second paragraph:

By default, columns will be transformed to STRING and delimited by TAB before feeding to the user script; similarly, all NULL values will be converted to the literal string \N in order to differentiate NULL values from empty strings. The standard output of the user script will be treated as TAB-separated STRING columns, any cell containing only \N will be re-interpreted as a NULL, and then the resulting STRING column will be cast to the data type specified in the table declaration in the usual way. User scripts can output debug information to standard error which will be shown on the task detail page on hadoop. These defaults can be overridden with ROW FORMAT ....

But I guess I'm confused when it says "all NULL values", how does it determine if the value is NULL? Is it smart enough to take into account the "serialization.null.format" table property?

If I understand the docs correctly, I think the default behavior is something like this:

Value to be stored to disk => Value actually stored to disk => Value sent to Python via TRANSFORM

Note \N is not an escape sequence, but literal two-char string backslash capital-n

(empty string) => (empty string) => "\N"
(null value) => (empty string) => "\N"
'NULL' => 'NULL' => 'NULL'
"\N" => "\N" => "\N"

Any corrections to that understanding would be awesome. :)

I'm not dealing with the default situation though. I am working on a system where all the tables have had this table property set:

TBLPROPERTIES ('serialization.null.format'='NULL')

It's not at all clear to me from the docs how that will effect how these values get stored to disk and how they will be translated during TRANSFORM and sent to Python:

Value to be stored to disk => Value actually stored to disk => Value sent to Python via TRANSFORM

Again, \N is not an escape sequence, but literal two-char string backslash capital-n.

(empty string) => (empty string) => ???
(null value) => 'NULL' => ???
'NULL' => 'NULL' => ???
"\N" => "\N" => "\N"

So my primary question is, can anyone provide some insight/clarity into that second scenario? I need to know how those values will show up in Python, so that I can correctly check to see if the value is null.

Serge Ballesta Serge Ballesta · Accepted Answer · 2014-07-24T09:27:28

What I understand from the doc is that Hive Transform do not use at all the "serialization.null.format" table property. It has its own null value replacement :

value in table | given to python script
---------------------------------------
empty string   | empty string
null value     | \N
NULL           | NULL
\N             | \N

So you will have to sanitize columns that could contain \N as their value, as you would have to do for any columns that could contain tabs. There is an explicit warning on referenced page :

Warning :It is your responsibility to sanitize any STRING columns prior to transformation. If your STRING column contains tabs, an identity transformer will not give you back what you started with! To help with this, see REGEXP_REPLACE and replace the tabs with some other character on their way into the TRANSFORM() call.

So for all non string column, all is fine : it cannot contain tab nor \N and the Python script will only have to considere that \N is the representation of a null value.

If you have String for which you are sure they contain neither tabs nor \N , fine too, it is the same of former case with an empty string being ... an empty string !

If a string column can contain tabs, but there is another character that cannot be present (say |), you replace in MAP clause tabs by | through regexp_replace, the python script will have to know that a | in this column is a tab, and you use the reverse regexp_replace in the REDUCE clause.

If a string column can contain \N, but there is another string that cannot be present, the former rule can be applied.

In the more general case, I think that the only bullet proof solution would be to convert null values to \N end encode all not null values in base64 (\N is hopefully not a valid base64 value ...) using something like if (A IS NULL, '\N', base64(binary(A))) and decoding with if (A = '\N', NULL, cast(unbase64(A) as STRING))

How "transform" affect stored to disk and read from disk values?

1 Answers