39
votes

This is a conceptual question involving Hadoop/HDFS. Lets say you have a file containing 1 billion lines. And for the sake of simplicity, lets consider that each line is of the form <k,v> where k is the offset of the line from the beginning and value is the content of the line.

Now, when we say that we want to run N map tasks, does the framework split the input file into N splits and run each map task on that split? or do we have to write a partitioning function that does the N splits and run each map task on the split generated?

All i want to know is, whether the splits are done internally or do we have to split the data manually?

More specifically, each time the map() function is called what are its Key key and Value val parameters?

Thanks, Deepak

10

10 Answers

25
votes

The InputFormat is responsible to provide the splits.

In general, if you have n nodes, the HDFS will distribute the file over all these n nodes. If you start a job, there will be n mappers by default. Thanks to Hadoop, the mapper on a machine will process the part of the data that is stored on this node. I think this is called Rack awareness.

So to make a long story short: Upload the data in the HDFS and start a MR Job. Hadoop will care for the optimised execution.

14
votes

Files are split into HDFS blocks and the blocks are replicated. Hadoop assigns a node for a split based on data locality principle. Hadoop will try to execute the mapper on the nodes where the block resides. Because of replication, there are multiple such nodes hosting the same block.

In case the nodes are not available, Hadoop will try to pick a node that is closest to the node that hosts the data block. It could pick another node in the same rack, for example. A node may not be available for various reasons; all the map slots may be under use or the node may simply be down.

10
votes

Fortunately everything will be taken care by framework.

MapReduce data processing is driven by this concept of input splits. The number of input splits that are calculated for a specific application determines the number of mapper tasks.

The number of maps is usually driven by the number of DFS blocks in the input files.

Each of these mapper tasks is assigned, where possible, to a slave node where the input split is stored. The Resource Manager (or JobTracker, if you’re in Hadoop 1) does its best to ensure that input splits are processed locally.

If data locality can't be achieved due to input splits crossing boundaries of data nodes, some data will be transferred from one Data node to other Data node.

Assume that there is 128 MB block and last record did not fit in Block a and spreads in Block b, then data in Block b will be copied to node having Block a

Have a look at this diagram.

enter image description here

Have a look at related quesitons

About Hadoop/HDFS file splitting

How does Hadoop process records split across block boundaries?

7
votes

I think what Deepak was asking was more about how the input for each call of the map function is determined, rather than the data present on each map node. I am saying this based on the second part of the question: More specifically, each time the map() function is called what are its Key key and Value val parameters?

Actually, the same question brought me here, and had i been an experienced hadoop developer, i may have interpreted it like the answers above.

To answer the question,

the file at a given map node is split, based on the value we set for InputFormat. (this is done in java using setInputFormat()! )

An example:

conf.setInputFormat(TextInputFormat.class); Here, by passing TextInputFormat to the setInputFormat function, we are telling hadoop to treat each line of the input file at the map node as the input to the map function. Linefeed or carriage-return are used to signal end of line. more info at TextInputFormat!

In this example: Keys are the position in the file, and values are the line of text.

Hope this helps.

4
votes

Difference between block size and input split size.

Input Split is logical split of your data, basically used during data processing in MapReduce program or other processing techniques. Input Split size is user defined value and Hadoop Developer can choose split size based on the size of data(How much data you are processing).

Input Split is basically used to control number of Mapper in MapReduce program. If you have not defined input split size in MapReduce program then default HDFS block split will be considered as input split during the data processing.

Example:

Suppose you have a file of 100MB and HDFS default block configuration is 64MB then it will chopped in 2 split and occupy two HDFS blocks. Now you have a MapReduce program to process this data but you have not specified input split then based on the number of blocks(2 block) will be considered as input split for the MapReduce processing and two mapper will get assigned for this job. But suppose, you have specified the split size(say 100MB) in your MapReduce program then both blocks(2 block) will be considered as a single split for the MapReduce processing and one Mapper will get assigned for this job.

Now suppose, you have specified the split size(say 25MB) in your MapReduce program then there will be 4 input split for the MapReduce program and 4 Mapper will get assigned for the job.

Conclusion:

  1. Input Split is a logical division of the input data while HDFS block is a physical division of data.
  2. HDFS default block size is a default split size if input split is not specified through code.
  3. Split is user defined and user can control split size in his MapReduce program.
  4. One split can be mapping to multiple blocks and there can be multiple split of one block.
  5. The number of map tasks (Mapper) are equal to the number of input splits.

Source : https://hadoopjournal.wordpress.com/2015/06/30/mapreduce-input-split-versus-hdfs-blocks/

1
votes

FileInputFormat is the abstract class which defines how the input files are read and spilt up. FileInputFormat provides following functionalites: 1. select files/objects that should be used as input 2. Defines inputsplits that breaks a file into task.

As per hadoopp basic functionality, if there are n splits then there will be n mapper.

1
votes

When a Hadoop job is run, it split input files into chunks and assign each split to a mapper to process; this is called InputSplit.

0
votes

The short answer is the InputFormat take care of the split of the file.

The way that I approach this question is by looking at its default TextInputFormat class:

All InputFormat classes are subclass of FileInputFormat, which take care of the split.

Specifically, FileInputFormat's getSplit function generate a List of InputSplit, from the List of files defined in JobContext. The split is based on the size of bytes, whose Min and Max could be defined arbitrarily in project xml file.

0
votes

There is a seperate map reduce job that splits the files into blocks. Use FileInputFormat for large files and CombineFileInput Format for smaller ones. You can also check the whether the input can be split into blocks by issplittable method. Each block is then fed to a data node where a map reduce job runs for further analysis. the size of a block would depend on the size that you have mentioned in mapred.max.split.size parameter.

0
votes

FileInputFormat.addInputPath(job, new Path(args[ 0])); or

conf.setInputFormat(TextInputFormat.class);

class FileInputFormat funcation addInputPath ,setInputFormat take care of inputsplit, also this code defines the number of mappers get created. we can say inputsplit and number of mappers is directly proportion to number of blocks used for storing input file on HDFS.

Ex. if we have input file with size 74 Mb , this file stored on HDFS in two blocks (64 MB and 10 Mb). so inputsplit for this file is two and two mapper instances get created for reading this input file.