How HDFS Storage works in brief:
Let say replication factor = 3 (default)
Data file size = 10GB (i.e xyz.log)
HDFS will take 10x3 = 30GB to store that file
Depending on the type of command you use, you will get different values for space occupied by HDFS (10GB vs 30GB)
If you are on latest version of Hadoop, try the following command. In my case this works very well on Hortonworks Data Platform (HDP) 2.3.* and above. This should also work on cloudera's latest platform.
hadoop fs -count -q -h -v /path/to/directory
(-q = quota, -h = human readable values, -v = verbose)
This command will show the following fields in the output.
QUOTA REMAINING_QUOTA SPACE_QUOTA REMAINING_SPACE_QUOTA DIR_COUNT FILE_COUNT CONTENT_SIZE FILE_NAME
Where
CONTENT_SIZE = real file size without replication (10GB) and
SPACE_QUOTA = space occupied in HDFS to save the file (30GB)
Notes:
Control replication factor here: Modify "dfs.replication" property found in hdfs-site.xml file under conf/ dir of default hadoop installation directory). Changing this using Ambari/Cloudera Manager is recommended if you have multinode cluster.
There are other commands to check storage space. E.G hadoop fsck, hadoop dfs -dus,