1
votes

I have a map-reduce job and I am using just the mapper because the output of each mapper will definitely have a unique key. My question is when this job is run and I get the output files, which are like part-m-00000, part-m-00001 ... Will they be sorted in order of key?

Or Do I need to implement a reducer which does nothing but just writes them to files like part-r-00000, part-r-000001. And does these guarantee that the output is sorted in the order of the key.

3

3 Answers

0
votes

If you want to sort the keys within the file and make sure that the keys in the file are less than the keys in file j when i is less than j, you not only need to use a reducer, but also a partitioner. You might want to consider using something like Pig to do this as it will be trivial. If you want to do it with MR, use the sorted field as your key and write a partitioner to make sure that your keys end up in the correct reducer.

0
votes

When your map function outputs the keys, it goes to the partition function which does a sort. Therefore by default the keys will be in sorted order and you can use the identity reducer.

0
votes

If you want to guarantee sorted order, you can simply use a single IdentityReducer.

If you want it to be more parallelizable, you can specify more reducers, but then the output will by default only be sorted within files, not across files. I.e., each file will be sorted, but part-r-00000 will not necessarily come before part-r-00001. If you DO want it to be sorted across files, you can use a custom partitioner that partitions based on the sorting order. I.E., reducer 0 gets all of the lowest keys, then reducer 1, ... and reducer N gets all of the highest keys.