HDFS block size and its relationship with underlying physical file-system block size

Question

I am trying to understand the relationship between the HDFS file-system block size and the underlying physical file-system block size.

As per my understanding hdfs is just a virtual file-system which stores the actual data on the underlying physical file-system. The HDFS block size in hadoop 2 is 128 MB; however in most of the linux based file-systems the block size is 4 KB.

My questions:

Q1) When a HDFS block is written to the actual file-system, does it write to multiple block of the underlying file-system? That is for a single HDFS blocks, it has to write to 128 * 1024 KB / 4 KB --> 32,768 blocks?

Q2) If above is correct, doesn't it involve lot of seeks on the disk head? Isn't it time consuming process? How is Hadoop to make this process efficiently?

Can anyone help me understand this?

Wyzard Wyzard · Accepted Answer · 2017-04-01T23:09:32

There's no connection between the two at all. The 128MB block size in HDFS just means that HDFS doesn't produce files bigger than 128MB. When it needs to store a larger amount of data, it divides it into several files. But the 128MB files created by HDFS are no different than 128MB files created by any other program.

You're correct that having lots of 4k blocks scattered all over the disk can lead to lots of disk seeks when accessing the file. To avoid that, when the operating system allocates space on disk for a file – any file, not just one created by HDFS – it tries to choose blocks that are adjacent to each other, so that the disk can seek once and then read or write all the blocks together.

For more information, read about disk fragmentation.

HDFS block size and its relationship with underlying physical file-system block size

My questions:

1 Answers