2
votes

I am trying to use a regex SerDe to create a hive table from a text file. I am starting easy and just want to parse each word in the text document into a row. Each row has one column and that is the word.

The regex I am using is ([a-zA-z]+)

And this is the later half of the create table command I am issuing to hive,

ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH SERDEPROPERTIES ( "input.regex" = "([a-zA-Z]+)", "output.format.string" = "%1$s" ) STORED AS TEXTFILE;

Currently the table contains almost all NULL.

Any help would be awesome, Thanks!

1
You mentioned that the table contains almost all NULLs. Can you elaborate please? Are there some non-NULL columns? Can you please provide a small sample of your data?Mark Grover

1 Answers

0
votes

When deserializing the input, the RegExSerDe expects a full match of the input line against the provided regex. If it doesn't match, then all the columns in the row will be NULL.

To split all the words from your input, you'll need to take a different approach, first load the input into a table as is. Then do a whitespace split on the input and explode it so you get the individual words. You might want to do some additional matching and filtering to remove unwanted interpunctuation characters.

CREATE TABLE input (text STRING);
LOAD DATA LOCAL INPATH 'input.txt' INTO TABLE input;
SELECT word FROM input LATERAL VIEW explode(split(text, ' ')) words AS word;