1
votes

I read about Hadoop's HDFS and came to know that hadoop is designed to process smaller number of larger sized files, rather than large number of small sized files.

The reason for this being that if there are larger number of small sized files, then Namenode's memory is quickly eaten away. I am having difficulty in understanding this argument.

Consider the following scenario:

1000 small files and each having size of 128 MB (the same block size of hdfs block).

So, this would mean 1000 entries in Namenode's memory holding this information.

Now, consider the following scenarios:

one single BIG file, who has 128 MB * 1000 block size.

Now won't Namenode have 1000 entries for this BIG single file?

Is this conclusion correct that in both these cases the Namenode would have same number of entries in memory regarding the block information of the file? If so, then how come hadoop is efficient for small number of larger sized files rather than larger number of small sized files?

Can anyone help in understanding this?

1

1 Answers

1
votes

Every file, directory and block in HDFS is represented as an object in the namenode’s memory, each of which occupies ~150 bytes.

Case 1:

Number of Files = 1000
Number of Blocks per file = 1
Total Number of Blocks = 1000 (Number of Files * Number of Blocks per file)
Total number of objects in Namenode's namespace = 2000 (Number of Files + Total Number of Blocks)
Total Namenode Memory Used = 2000 * 150 bytes

Case 2:

Number of Files = 1
Number of Blocks per file = 1000
Total Number of Blocks = 1000 (Number of Files * Number of Blocks per file)
Total number of objects in Namenode's namespace = 1001 (Number of Files + Total Number of Blocks)
Total Namenode Memory Used = 1001 * 150 bytes

In both the cases, the total size occupied by the data remains same. But in the first scenario 300KB of namenode's memory is used whereas only 150.15KB is used in the second scenario.