0
votes

Suppose the data size of a file XYZ, is 68MB. So the blocks (where default block is 64MB) will be A- 64MB and B - 4MB. In the B block, rest of the space is occupied by another data block.

So when processing is done for XYZ data file, the A anb B blocks data will be processed. Since B block contains data for another file too, how does the HADOOP know which part of the block is to processed in case of B block?

1

1 Answers

1
votes

If you have file (XYZ) of 68 MB and assuming your block size being 64MB then the data will be split into 2 blocks. Block-A will store 64MB of data and then the Block-B will store rest of the 4MB and the block will be closed (there is no wastage of space here), no other file's data will be put into Block-B.

So while processing, MapReduce knows exactly which blocks to process for a specific file. Of course, there are other constraints like input split's which are taken into consideration by MapReduce while processing the blocks to figure out record boundaries.