Hadoop MapReduce thinking

Question

I am newbie in the world of hadoop mapreduce framework. I read a lot of tutorials myself and understood the framework. I have successfully configured a hadoop setup in pseudo distributed mode. I have two specific tasks I need to accomplish in Hadoop MapReduce.

I have many many data files with the following format.

Number of exchanged messages; user1; user2; time stamp;

An example would be: 5; John Doe; John Smith; 1/1/1900;

What I would like to accomplish is

do data masking on the user names (like building SHA256 on top of usernames, so that they are anonymous.)
aggregate the number of exchanged messages in a given period (say 1 week)

Now let us come to my questions: According to my current knowledge, the hadoop mapreduce framework is intended to accomplish the second task. I can map the key-values (two user names together between whom the messages were exchanged, the number of messages) and reduce it to gain the total number of messages in a given period (say 1 week). But what about the first task? when I do data masking, there are no reduce operations, is this task not something for hadoop mapreduce? I want to do it parallel, but can't really think of applying hadoop mapreduce to accomplish the first task. The number of data files I need to process is really large, which makes think of using hadoop mapreduce anyhow.

Thanks for your comments!

P.S.: The question can be generalized to "for which type of tasks is hadoop MapReduce best suitable?"

Why can't you do the map part do the transformation? Map part of Map/Reduce can be used to transform data as well, so your task is: Map -> Record to #; encoded user 1; encoded user 2; date. Reduce will transform it to get the stats you need — abatyuk
You are right, I can do only the map part. Is it a good approach when I do only the map part, store the masked record files back into the HDFS, and read it again to do the reduce jobs on top the masked records? The main memory can't hold the dataset as a whole. — Bob
As far as I understood, reducers wait until the mappers did their job which in my case means that all data files had to be transformed. Can you please explain the workflow that you are thinking? — Bob

Thomas Jungblut Thomas Jungblut · Accepted Answer · 2012-06-01T08:22:25

The first task is a perfect fit for a map only job. MapReduce generally is suitable for sorting, mapping (apply some fancy function on data) and reduce data.

So your problem fits into MR very well.

MapReduce is not suitable if you need strong communication between tasks or iteration heavy tasks like in graph algorithms. For that BSP is suited best, you can choose between Hama or Giraph for that, whereas Giraph provides majorly graph processing and Hama is pure BSP framework which also has a module for graph processing.

Hadoop MapReduce thinking

1 Answers