File having list of numbers separated by pipeline, can have duplicates. Need to write map reduce program to list out numbers with out duplicate in original input order. Am able to remove duplicates but it doesn't retain input order.
2
votes
Why not use Hive, Pig, or Spark to do this? Each can do this in less than 10 lines of code
– OneCricketeer
Okay, then force the data to one reducer and sort it
– OneCricketeer
It's not clear if you're being given input data that's sorted already, or rather you're being asked to preserve the exact order of input data, just without duplicates
– OneCricketeer
yes need to preserve input order. basically the input data not in any order.
– suba
as of now my output is like ( key=number, val= order_position). i think i need to make it key=order_position, val=number so that it preserves the input order. [ in mapper i m assining order_position sequence number to the numbers]
– suba
1 Answers
2
votes
Its very simple, Suppose your text is :
Line 1 -> On the top of the Crumpetty Tree
Line 2 -> The Quangle Wangle sat,
Line 3 -> But his face you could not see,
Line 4 -> On account of his Beaver Hat.
Line 5 -> But his face you could not see,
Line 6 -> The Quangle Wangle sat,
Where Line 2
and 3
are repeating at line 5
and 6
.
The mapper should be similar to wordcount
program, where input to the mapper is something like
key-value pairs:
(0, On the top of the Crumpetty Tree)
(33, The Quangle Wangle sat,)
(57, But his face you could not see,)
(89, On account of his Beaver Hat.)
(113, But his face you could not see,)
(146, The Quangle Wangle sat,)
The output of the mapper
(NullWritable, 0_On the top of the Crumpetty Tree)
(NullWritable, 33_The Quangle Wangle sat,)
(NullWritable, 57_But his face you could not see,)
(NullWritable, 89_On account of his Beaver Hat.)
(NullWritable, 113_But his face you could not see,)
(NullWritable, 146_The Quangle Wangle sat,)
Now, make sure that you have only one reducer, such that the input of the single reducer would be
Input to reducer
Key: NullWritable
Iterable<value>: [(0_On the top of the Crumpetty Tree),
(33_The Quangle Wangle sat,),
(57_But his face you could not see,),
(89_On account of his Beaver Hat.),
(113_But his face you could not see,),
(146_The Quangle Wangle sat,)]
Note that the input to the reducer is sorted in ascending order, and in this case it maintains the original order because the offset
to line in TextInputFormat
is always in ascending
order.
In the reducer, just iterate through the list, weed out duplicates and write the line after removing the offset
and the _
delimiter at the start. The reducer output would something like:
Reducer key-value
NullWritable, value.split("_")[1]
Output from reducer
Line 1 -> On the top of the Crumpetty Tree
Line 2 -> The Quangle Wangle sat,
Line 3 -> But his face you could not see,
Line 4 -> On account of his Beaver Hat.