Deciding key value pair for deduplication using hadoop mapreduce

Question

I want to implement deduplication of files using Hadoop Mapreduce. I plan to do it by calculating MD5 sum of all the files present in the input directory in my mapper function. These MD5 hash would be the key to the reducer, so files with the same hash would go to the same reducer.

The default for the mapper in Hadoop is that the key is the line number and the value is the content of the file.

Also I read that if the file is big, then it is split into chunks of 64 MB, which is the maximum block size in Hadoop.

How can I set the key values to be the names of the files, so that in my mapper I can compute the hash of the file ? Also how to ensure that no two nodes will compute the hash for the same file?

How big are your files? I mean in terms of size... Are they text files? — Arun A K
They are text files, which maybe anywhere between few KB's to 100's of MB's. — ManTor
WholeFileInputFormat (not a part of the hadoop code) can be used here. You can get the implementation online or its available in the Hadoop: The Definitive Guide book. This will give whole file to map as value. Do MD5 on this value and emit as key. Value can be the file name. Calling getInputSplit() on Context instance would give you the input splits which can be cast as filesplits. Then fileSplit.getPath().getName() would yield you the file name. This would give you the filename, which could be emitted as the value. — Arun A K
@Arun A K, write this same thing as a answer, so that it would be helpful to others too. — Jay K
@Jay K, Thanks for the suggestion. I just stood back from using the 'answer' section because this may be only one crude way of having the requirement done. I thought there would be some one better who could post a better solution. — Arun A K

Arun A K Arun A K · Accepted Answer · 2014-03-31T04:45:49

If you would need to have the entire file as input to one mapper, then you need to keep the isSplitable false. In this scenario you could take in the whole file as input to the mapper and apply your MD5 on the same and emit it as the key.

WholeFileInputFormat (not a part of the hadoop code) can be used here. You can get the implementation online or its available in the Hadoop: The Definitive Guide book.

Value can be the file name. Calling getInputSplit() on Context instance would give you the input splits which can be cast as filesplits. Then fileSplit.getPath().getName() would yield you the file name. This would give you the filename, which could be emitted as the value.

I have not worked on this - org.apache.hadoop.hdfs.util.MD5FileUtils, but the javadocs says that this might be what works good for you.

Textbook src link for WholeFileInputFormat and associated RecordReader have been included for reference

1) WholeFileInputFormat

2) WholeFileRecordReader

Also including the grepcode link to MD5FileUtils

Deciding key value pair for deduplication using hadoop mapreduce

1 Answers