Finding Top-K using Python & Hadoop Streaming

Question

So I have an output file from a previous job in this format (.txt file)

"   145
"Defects,"  1
"Information    1
"Plain  2
"Project    5
"Right  1
#51302] 1
$5,000) 1
&   3
'AS-IS',    1
(   1
("the   1

The left hand side of each line is the word that I read from the document and the number on the right hand side of each line is the number of times I have counted it. I want to create another map reduce job, using Python & Hadoop Streaming, to find the top-k values. Let's say 5 in this case. I am having trouble visualizing what the mapper is supposed to do.

Should I be parsing each line and appending each word and count to a list. Then from those lists, will I be taking the top-k values and sending it to the reducer? Then the reducer reads through all these lists and returns only the top-k values? If someone could offer some advice through pseudo code or correct me if I am on the wrong path it would be appreciated. Thanks!

Preeti Khurana Preeti Khurana · Accepted Answer · 2016-09-25T14:57:12

You are almost on the right track. Consider your word as the Key and the count as the Value for your mapper task. If in your input file, you can get multiple entries for the same word and different count, then you can't take out top K from it. Then you shall have to aggregate the data and then find out top K . This shall be done in reducer then. As reducer shall receive all the data for the same key, it can aggregate the complete data and take out top K. But then there has to be another chained map reduce to find out top K amongst all records, where you shall have 1 reducer for finding the top elements.

But if your input file has an entry for the key once, you can just emit top K from all mappers and then send it to 1 Reducer to find out the top K from all the map entries.

Finding Top-K using Python & Hadoop Streaming

1 Answers