Number of map tasks and split size

Question

What I'm trying to do

I'm new to hadoop and I'm trying to perform MapReduce several times with a different number of mappers and reducers, and compare the execution time. The file size is about 1GB, and I'm not specifying the split size so it should be 64MB. I'm using a machine with 4 cores.

What I've done

The mapper and reducer are written in python. So, I'm using hadoop streaming. I specified the number of map tasks and reduce tasks by using '-D mapred.map.tasks=1 -D mapred.reduce.tasks=1'

Problem

Because I specified to used 1 map task and 1 reduce task, I expected to see just one attempt but I actually have 38 map attempts, and 1 reduce task. I read tutorials and SO questions similar to this problem, and some said that the default map task is 2, but I'm getting 38 map tasks. I also read that mapred.map.tasks only suggests the number and the number of map tasks is the number of split size. However, 1GB divided by 64MB is about 17, so I still don't understand why 38 map tasks were created.

1) If I want to use only 1 map task, do I have to set the input splits size to 1GB??

2) Let's say I successfully specify that I want to use only 2 map tasks, does it use 2 cores? And each core has 1 map task??

Tariq Tariq · Accepted Answer · 2013-11-27T20:26:19

Number of mappers is actually governed by the InputFormat you are using. Having said that, based on the type of data you are processing, InputFormat may vary. Normally, for the data stored as files in HDFS FileInputFormat, or a subclass, is used which works on the principle of MR split = HDFS block. However, this is not always true. Say you are processing a flat binary file. In such a case there is no delimiter(\n or something else) to represent the split boundary. What would you do in such a case? So, the above principle doesn't always work.

Consider another scenario wherein you are processing data stored in a DB, and not in HDFS. What will happen in such a case as there is no concept of 64MB block size when we talk about DBs?

The framework tries its best to carry out the computation in a manner as efficient as possible, which might involve creation of lesser/more number of mappers as specified/expected by you. So, in order to see how exactly mappers are getting created you need to look into the InputFormat you are using in your job. getSplits() method to be precise.

If I want to use only 1 map task, do I have to set the input splits size to 1GB??

You can override the isSplitable(FileSystem, Path) method of your InputFormat to ensure that the input files are not split-up and are processed as a whole by a single mapper.
Let's say I successfully specify that I want to use only 2 map tasks, does it use 2 cores? And each core has 1 map task??

It depends on availability. Mappers can run on multiple cores simultaneously. And a single core can run multiple mappers sequentially.

Number of map tasks and split size

What I'm trying to do

What I've done

Problem

2 Answers