3
votes

I am totally new to Hadoop though I understand the concept of map reduce fairly well.

Most of the Hadoop tutorials start with the WordCount example. So I wrote a simple wordcount program which worked perfectly well. But then I am trying to take word count of a very large document. (Over 50GB).

So my question to the Hadoop experts is, how will Hadoop handle the large file ? Will it transfer the copies of the file to each mapper or will it automatically split it into blocks and transfer those blocks to mappers ?

Most of my experience with MapReduce was because of CouchDB where mapper handles on document at a time but from what I read about Hadoop, I wonder if it is designed to handle multiple small files or few large files or both ?

3
Check Chris's answer here: stackoverflow.com/questions/10719191/…Amar

3 Answers

3
votes

Hadoop handles large files by splitting them to blocks of size 64MB or 128MB (default). These blocks are available across Datanodes and metadata is in Namenode. When mapreduce program runs each block gets a mapper for execution. You cannot set number of mappers. When mappers are done they are sent to reducer. The default number of reducers is one and can be set and thats where you get the output. It can even handle multiple small files but its preferable to group them to large file for better performance. For eg. if each small file is less than 64MB then each file gets a mapper for execution. Hope this helps!

1
votes

Huge files in HDFS are already stored in a distributed fashion. When you run a mapreduce job, you have to specify an InputFormat for your file. If the InputFormat is splittable (i.e, it's uncompressed, or compressed in bz2 format) then it can be divided among as many mappers as you want. Most reasonable implementations ensure that all records in the file go to some mapper and no mapper gets the same record twice.

Copies of the file are not transferred - the mappers just read the segment of the file that they are assigned. These are either streamed over the network, or assigned to the machine that the piece of the file is stored on, if possible. You can read as many input files as you want with Hadoop, as long as you specify the input format for each one.

0
votes

Hadoop by default will split the data by file and send each file to a mapper. You can override this, but it's a little complicated. I always just use a script to break up files if they aren't already separated.