I am newbie in the world of hadoop mapreduce framework. I read a lot of tutorials myself and understood the framework. I have successfully configured a hadoop setup in pseudo distributed mode. I have two specific tasks I need to accomplish in Hadoop MapReduce.
I have many many data files with the following format.
Number of exchanged messages; user1; user2; time stamp;
An example would be: 5; John Doe; John Smith; 1/1/1900;
What I would like to accomplish is
do data masking on the user names (like building SHA256 on top of usernames, so that they are anonymous.)
aggregate the number of exchanged messages in a given period (say 1 week)
Now let us come to my questions: According to my current knowledge, the hadoop mapreduce framework is intended to accomplish the second task. I can map the key-values (two user names together between whom the messages were exchanged, the number of messages) and reduce it to gain the total number of messages in a given period (say 1 week). But what about the first task? when I do data masking, there are no reduce operations, is this task not something for hadoop mapreduce? I want to do it parallel, but can't really think of applying hadoop mapreduce to accomplish the first task. The number of data files I need to process is really large, which makes think of using hadoop mapreduce anyhow.
Thanks for your comments!
P.S.: The question can be generalized to "for which type of tasks is hadoop MapReduce best suitable?"