1
votes

Every time I use hadoop fs -ls /path_to_directory or hadoop fs -ls -h /path_to_directory , the result is like the following

drwxr-xr-x   - hadoop supergroup          0 2016-08-05 00:22/user/hive-0.13.1/warehouse/t_b_city
drwxr-xr-x   - hadoop supergroup          0 2016-06-15 16:28/user/hive-0.13.1/warehouse/t_b_mobile

The size of directory inside HDFS is always shown as 0 no matter there is file wihin it or not.

Browsing from the web UI gives the same reuslt as following :

drwxr-xr-x  hadoop  supergroup  0 B 0   0 B t_b_city
drwxr-xr-x  hadoop  supergroup  0 B 0   0 B t_b_mobile

However, there are actually files within those directory. When using command hadoop fs -du -h /user/hive-0.13.1/warehouse/ , the directory size can be shown correctly as the following:

385.5 K   /user/hive-0.13.1/warehouse/t_b_city
1.1 M     /user/hive-0.13.1/warehouse/t_b_mobile

Why would the hadoop fs -ls command of hdfs and the web UI always show 0 for a directory ?

Also, the hadoop fs -ls command usually finish immediately while the hadoop fs -du would take sometime to execute. It seems that the hadoop fs -ls command doesn't actually spend time on calculating total size of a directory.

2
When you run a ls -l command on Linux, the "size" displayed for directories is not related to the size of the files inside. So why did you expect HDFS to work differently??? - Samson Scharfrichter
BTW, the NameNode stores the whole filesystem information in RAM and not on disk, therefore a directory entry requires zero bytes on disk. On the other hand Linux filesystems require a few disk segments to persist each directory (list of inodes, permissions etc) - Samson Scharfrichter
Thanks. Seems my understanding for the ls command have long been wrong. I took it for granted that ls will show size for both file and directory. - Heyang Wang
Again, the size of a directory is the size of the directory object. Just like the size of a file is the size of a file. Full stop. - Samson Scharfrichter

2 Answers

2
votes

It is working as designed. Hadoop is designed for big files and one should not expect it to give the size of each and every time one run hadoop fs -ls command. If Hadoop works in way you want then try to think from another person point of view who might just want to see whether directory exists or not; but end up waiting long time just because Hadoop is calculating size of folder; not so good.

1
votes

try to do the wild card with the du option so that all the files under a db are listed with the file sizes. The only catch here is that we need to go for multiple levels of wilcard pattern match so that all the levels under the parent directory is covered.

hadoop fs -du -h /hive_warehouse/db/*/* > /home/list_du.txt