I want to make sure I am getting this concept right:
In Hadoop the Definite Guide it is stated that: "the goal while designing a file system is always to reduce the number of seeks in comparison to the amount of data to be transferred." In this statement the author is referring to the "seeks()" of Hadoop logical blocks, right?
I am thinking that no matter how big the Hadoop block size is (64MB or 128MB or bigger) the number of seeks of the physical blocks (which are usually 4KB or 8KB) that the underlying filesystem (e.g. ext3/fat) will have to perform will be the same no matter the size of Hadoop block size.
Example: To keep numbers simple, assume underlying file system block size is 1MB. We want to read a file of size 128MB. If the Hadoop block size is 64MB, the file occupies 2 blocks. When reading there are 128 seeks. if the Hadoop block size is increased to 128MB, the number of seeks performed by the files system is still 128. In the second case, Hadoop will perform 1 seek instead of 2.
Is my understanding correct?
If I am correct, a substantial performance improvement by increasing block size will only be observed only for very large files, right? I am thinking that in the case of files that are in the 1~GB size range, reducing the number of seeks from ~20 seeks (64MB block size) to ~10 seek (128MB block size) shouldn't make much of a difference, right?