How the data is split in Hadoop

12

votes

Does the Hadoop split the data based on the number of mappers set in the program? That is, having a data set of size 500MB, if the number of mappers is 200 (assuming that the Hadoop cluster allows 200 mappers simultaneously), is each mapper given 2.5 MB of data?

Besides,do all the mappers run simultaneously or some of them might get run in serial?

hadoopmapreducehadoop-partitioning

1

votes

I just ran a sample MR program based on your question and here is my finding

Input: a file smaller that block size.

Case 1: Number of mapper =1 Result : 1 map task launched. Inputsplit size for each mapper(in this case only one) is same as the input file size.

Case 2: Number of mappers = 5 Result : 5 map tasks launched. Inputsplit size for each mapper is one fifth of the input file size.

Case 3: Number of mappers = 10 Result : 10 map tasks launched. Inputsplit size for each mapper is one 10th of the input file size.

So based on above, for file less then block size,

split size = total input file size / number of map task launched.

Note: But keep in mind that no. of map task is decided by based on input splits.

26

votes

It's the other way round. Number of mappers is decided based on the number of splits. In reality it is the job of InputFormat, which you are using, to create the splits. You do not have any idea about the number of mappers until number of splits has been decided. And, it's not always that splits will be created based on the HDFS block size. It totally depends on the logic inside the getSplits() method of your InputFormat.

To better understand this, assume you are processing data stored in your MySQL using MR. Since there is no concept of blocks in this case, the theory that splits are always created based on the HDFS block fails. Right? What about splits creation then? One possibility is to create splits based on ranges of rows in your MySQL table (and this is what DBInputFormat does, an input format for reading data from a relational database). Suppose you have 100 rows. Then you might have 5 splits of 20 rows each.

It is only for the InputFormats based on FileInputFormat (an InputFormat for handling data stored in files) that the splits are created based on the total size, in bytes, of the input files. However, the FileSystem blocksize of the input files is treated as an upper bound for input splits. If you have a file smaller than the HDFS block size, you'll get only 1 mapper for that file. If you want to have some different behavior, you can use mapred.min.split.size. But it again depends solely on the getSplits() of your InputFormat.

There is a fundamental difference between MR split and HDFS block and folks often get confused by this. A block is a physical piece of data while a split is just a logical piece which is going to be fed to a mapper. A split does not contain the input data, it is just a reference to the data. Then what is a split? A split basically has 2 things : a length in bytes and a set of storage locations, which are just hostname strings.

Coming back to your question. Hadoop allows much more than 200 mappers. Having said that, it doesn't make much sense to have 200 mappers for just 500MB of data. Always remember that when you talk about Hadoop, you are dealing with very huge data. Sending just 2.5 MB data to each mapper would be an overkill. And yes, if there are no free CPU slots then some mappers may run after the completion of current mappers. But the MR framework is very intelligent and tries its best to avoid these kind of situation. If the machine where data to processed is present, doesn't have any free CPU slots, the data will be moved to a nearby node, where free slots are available, and get processed.

HTH

6

votes

When you input data into Hadoop Distributed File System (HDFS), Hadoop splits your data depending on the block size (default 64 MB) and distributes the blocks across the cluster. So your 500 MB will be split into 8 blocks. It does not depend on the number of mappers, it is the property of HDFS.

Now, when you run a MapReduce job, Hadoop by default assigns 1 mapper per block, so if you have 8 blocks, hadoop will run 8 map tasks.

However, if you specify the number of mappers explicitly (i.e 200), then the size of data being processed by each Map depends on the distribution of the blocks, and on which node your mapper is running. How many mappers actually process your data depends on your input split.

In your case, assuming 500 MB split into 8 blocks, even if you specify 200 mappers, not all of them will process data even if they are initialized.

0

votes

If 200 mapper are running for 500mb of data, then you need to check for each individual file size. If that file size is lesser than block size (64 mb ) then it will run map task for each file.

Normally we merge the smaller files in large file (sizing greater than block size)

0

votes

No. It's not.

Number of Mappers for a Job is defined by Framework.

Have a look at Apache MapReduce tutorial link.

How Many Maps?

The number of maps is usually driven by the total size of the inputs, that is, the total number of blocks of the input files.

The right level of parallelism for maps seems to be around 10-100 maps per-node, although it has been set up to 300 maps for very cpu-light map tasks. Task setup takes a while, so it is best if the maps take at least a minute to execute.

Thus, if you expect 10TB of input data and have a blocksize of 128MB, you’ll end up with 82,000 maps, unless Configuration.set(MRJobConfig.NUM_MAPS, int) (which only provides a hint to the framework) is used to set it even higher.

Coming back to your queries :

That is, having a data set of size 500MB, if the number of mappers is 200 (assuming that the Hadoop cluster allows 200 mappers simultaneously), is each mapper given 2.5 MB of data?

If DFS block and Input Split size is 128 MB, then 500 MB file requires 4 Mappers to process the data. Framework will run 4 Mapper tasks in above case.

Do all the mappers run simultaneously or some of them might get run in serial?

All Mappers run simultaneously. But Reducer will run only when output from all Mappers has been copied and available for them.

How the data is split in Hadoop

5 Answers