2
votes

File having list of numbers separated by pipeline, can have duplicates. Need to write map reduce program to list out numbers with out duplicate in original input order. Am able to remove duplicates but it doesn't retain input order.

1
Why not use Hive, Pig, or Spark to do this? Each can do this in less than 10 lines of codeOneCricketeer
Okay, then force the data to one reducer and sort itOneCricketeer
It's not clear if you're being given input data that's sorted already, or rather you're being asked to preserve the exact order of input data, just without duplicatesOneCricketeer
yes need to preserve input order. basically the input data not in any order.suba
as of now my output is like ( key=number, val= order_position). i think i need to make it key=order_position, val=number so that it preserves the input order. [ in mapper i m assining order_position sequence number to the numbers]suba

1 Answers

2
votes

Its very simple, Suppose your text is :

Line 1 -> On the top of the Crumpetty Tree
Line 2 -> The Quangle Wangle sat,
Line 3 -> But his face you could not see,
Line 4 -> On account of his Beaver Hat.
Line 5 -> But his face you could not see,
Line 6 -> The Quangle Wangle sat,

Where Line 2 and 3 are repeating at line 5 and 6.

The mapper should be similar to wordcount program, where input to the mapper is something like

key-value pairs:

(0, On the top of the Crumpetty Tree)
(33, The Quangle Wangle sat,)
(57, But his face you could not see,)
(89, On account of his Beaver Hat.)
(113, But his face you could not see,)
(146, The Quangle Wangle sat,)

The output of the mapper

(NullWritable, 0_On the top of the Crumpetty Tree)
(NullWritable, 33_The Quangle Wangle sat,)
(NullWritable, 57_But his face you could not see,)
(NullWritable, 89_On account of his Beaver Hat.)
(NullWritable, 113_But his face you could not see,)
(NullWritable, 146_The Quangle Wangle sat,)

Now, make sure that you have only one reducer, such that the input of the single reducer would be

Input to reducer

Key: NullWritable
Iterable<value>: [(0_On the top of the Crumpetty Tree), 
(33_The Quangle Wangle sat,), 
(57_But his face you could not see,), 
(89_On account of his Beaver Hat.), 
(113_But his face you could not see,), 
(146_The Quangle Wangle sat,)]

Note that the input to the reducer is sorted in ascending order, and in this case it maintains the original order because the offset to line in TextInputFormat is always in ascending order.

In the reducer, just iterate through the list, weed out duplicates and write the line after removing the offset and the _ delimiter at the start. The reducer output would something like:

Reducer key-value

NullWritable, value.split("_")[1]

Output from reducer

Line 1 -> On the top of the Crumpetty Tree
Line 2 -> The Quangle Wangle sat,
Line 3 -> But his face you could not see,
Line 4 -> On account of his Beaver Hat.