So I have an output file from a previous job in this format (.txt file)
" 145
"Defects," 1
"Information 1
"Plain 2
"Project 5
"Right 1
#51302] 1
$5,000) 1
& 3
'AS-IS', 1
( 1
("the 1
The left hand side of each line is the word that I read from the document and the number on the right hand side of each line is the number of times I have counted it. I want to create another map reduce job, using Python & Hadoop Streaming, to find the top-k values. Let's say 5 in this case. I am having trouble visualizing what the mapper is supposed to do.
Should I be parsing each line and appending each word and count to a list. Then from those lists, will I be taking the top-k values and sending it to the reducer? Then the reducer reads through all these lists and returns only the top-k values? If someone could offer some advice through pseudo code or correct me if I am on the wrong path it would be appreciated. Thanks!