Map Reduce Removing Duplicates

Question

I have been given a large text file and want to find the number of different words that start with each letter. I am trying to understand input and output values for map and reduce functions.

I understand a simpler problem which does not need to deal with duplicate words: determine the frequency with which each letter of the alphabet starts a word in the text using map reduce.

Map input: <0, “everyday i am city in tomorrow easy over school i iterate tomorrow city community”>

Map output: [<e,1>,<i,1>,<a,1>,<c,1>,<i,1>,<t,1>,<e,1>,<o,1>,<s,1>,<i,1>,<i,1>,<t,1>,<c,1>,<c,1>]

Reduce input: <a,[1]>,<c,[1,1,1]>,<e,[1,1]>,<i,[1,1,1,1]>,<o,[1]>,<s,[1]>,<t,[1,1]>

Reduce output: [<a,1>,<c,3>,<e,2>,<i,4>,<o,1>,<s,1>,<t,2>]

For the above problem the words 'i' 'city' and 'tomorrow' appear more than once so my final output should be:

Reduce output: [<a,1>,<c,2>,<e,2>,<i,3>,<o,1>,<s,1>,<t,1>]

I am unsure of how I would ensure duplicate words are remove in the above problem (would it be done in a pre processing phase or could it be implemented on either map or reduce functions). If I could get help understanding the map and reduce outputs of the new problem I would appreciate it.

Yaroslav Fyodorov Yaroslav Fyodorov · Accepted Answer · 2020-12-14T10:57:26

You can do it in two map-reduce passes:

find all the distinct word by using word as a map output and in reduce outputting each word once
you already solved - find frequency of each initial letter on each unique word.

Alternatively, since there are not many unique words you can cache them in mapper and output each one (or its first letter) only once and the reduce will be identical to your simpler problem. Actually, no, that won't work because same words can appear in different mappers. But you can still cache the words in mapper in the first solution and output each word only once per mapper - a little bit less traffic between map and reduce.

Map Reduce Removing Duplicates

2 Answers