1
votes

My question is that I have a text file with 100 words in it separated by space and I need to do a word count program.

So, when my name node splits the file into HDFS blocks, how can we be assured that the splitting is done at the end of the word only?

I.e., if I have my 50th word in the text file as Hadoop, what if while splitting it into 64MB blocks, the storage of the current block might reach 64MB at the centre of the word Hadoop and thus one block contains 'had' and the other 'oop' in some other block.

Sorry if the question might sound silly, but please provide the answer.Thanks .

1

1 Answers

1
votes

Your answer to this is inputsplit.

As HDFS does not know the content of the file. While storing data into multiple blocks, last record of each block might be broken. The first part of the record may be in one block and the last part of the same record might be in some other block.

To solve this type of problems in blocks MapReduce uses the concept of Input Splits.

‘Block’ is nothing but the physical division of data having size of 128MB distributed across multiple Data Nodes whereas ‘Input Split’ is a logical division of data.

While running MapReduce program the number of mapper depends upon the number of input splits and while processing input split includes the location of next block which contains the broken record.

Above diagram shows that there are three HDFS blocks and last part of Block-1 data is stored in Block-2. In this case input split will get the location of Block-2 to retrieve the broken record.

hadoopchannel