0
votes

I have tab-separated output from a Mahout recommender that I'd like to query in Hive. The recommendations look like this:

54508 [19:4.9,22:3.5]
54584 [17:5.2]
54648 [13:6.1,3:5.9]
54692 [17:8.1]
55424 [1:3.8]
55448 [16:2.7,3:1.2]
55452 [17:6.8]
57084 [42:6.8,3:5.4]
57212 [17:3.5]

There are two columns: the first column contains a userID, and the second contains a list of recommended products and their expected ratings.

I created a Hive table:

CREATE TABLE `recommendations_raw`(
  user int, 
  recommendations string)
ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY '\t' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  '/etl/recommender/output';

And I'm able to transform the data into a long tabular form in a Hive query:

select
   user,
   product,
   rating
from recommendations_raw
lateral view explode(str_to_map(substr(recommendations, 2, length(recommendations) - 2), ",", ":")) product_rating as product, rating

user    product   rating
54508   19        4.9
54508   22        3.5
54584   17        5.2
[etc...]

However, I would have preferred to create the map inside the create table statement instead of using str_to_map inside the query since it seems wrong to create a table with string datatype when it's really a map.

Is this possible/practical? If so, how?

1

1 Answers

1
votes

Looks like in essence, you are using an EXTERNAL TABLE on a TEXT datafile produced by a non-Hive program (Mahout in this case).

If the file format was compatible with the way Hive serializes its MAP data type in TEXT (which is not the case because of the enclosing brackets), I guess you could just "map" a MAP column (excuse the pun) on that list of key:value. Google pointed me to that post for example.

But anyway, TEXT is TEXT. Hive has to deserialize the map on every read, whether implicitly (in the case of a MAP column definition) or explicitly (in case of a STRING column plus user-defined str_to_map()).

Bottom line: if your goal is simply to explode the list and feed another table with a "normalized" structure, as shown in your sample code, then your solution with str_to_map() is better because it is more versatile (can manage the brackets...!)