0
votes

I have a dataset in the form of a userID, itemID ratings matrix that I am trying to convert to the form {userID, itemID, rating} for use with the Mahout item-based recommender, as described here: https://mahout.apache.org/users/recommender/userbased-5-minutes.html#dataset.

In other words, I want to convert something like this:

    1   2   3
1   1.0 2.0 3.0
2   4.0 5.0 6.0
3   7.0 8.0 9.0

Into something like this:

1,1,1.0
1,2,2.0
1,3,3.0
2,1,4.0
2,2,5.0
2,3,6.0
3,1,7.0
3,2,8.0
3,3,9.0

Is there a way to accomplish this using Apache Hadoop tools (Pig, Hive, etc.)?

1

1 Answers

0
votes

You can use explode (in hive):

if your input table looks like this:

userID item1 item2 item3
----------------------
1      1.0   2.0   3.0
2      4.0   5.0   6.0
3      7.0   8.0   9.0

Then your query can be:

SELECT userID, split(item_val,'_')[0] as item, split(item_val,'_')[1] as val     
from ( SELECT userID, 
        array(concat_ws('_','item1',item1),
        concat_ws('_','item2',item2),
        concat_ws('_','item3',item3))  as arr from in_table) a
LATERAL VIEW explode(arr) exp as item_val;

Explanation: The inner query generates this output:

userID                 arr
-----------------------------------------
1      (item1_1.0   item2_2.0   item3_3.0)
2      (item1_4.0   item2_5.0   item3_6.0)
3      (item1_7.0   item2_8.0   item3_9.0)

Then after the explode, each line will have userID, itemID and value - only need to split the itemID and the value.

Also, if the table's itemIDs are defined as double, you need to CAST(item2 as string) before sending them into concat_ws.