I want to implement deduplication of files using Hadoop Mapreduce. I plan to do it by calculating MD5 sum of all the files present in the input directory in my mapper function. These MD5 hash would be the key to the reducer, so files with the same hash would go to the same reducer.
The default for the mapper in Hadoop is that the key is the line number and the value is the content of the file.
Also I read that if the file is big, then it is split into chunks of 64 MB, which is the maximum block size in Hadoop.
How can I set the key values to be the names of the files, so that in my mapper I can compute the hash of the file ? Also how to ensure that no two nodes will compute the hash for the same file?