2
votes

I am trying out to load our data in hadoop hdfs. After some test runs, when check hadoop web ui, I realise that there is a lot of space consumed under title "Non-DFS used". In fact, "Non-DFS used" is more than "DFS used". So almost half the cluster is consumed by Non-DFS data.

Even after reformatting namenode and restarting, this " Non-DFS" space is not freed up.

Also I am not able to find the directory under which this "Non-DFS" data is stored, so that I can manually delete those files.

I read many threads online from people stuck at the exact same issue, but none got definitive answer.

Is it so difficult to empty this "Non-DFS" space? Or should I be not deleting it? How can I free up this space?

2

2 Answers

4
votes

In HDFS, Non-DFS is storage in the datanode which is not occupied by the hdfs data.

Look at the datanode hdfs-site.xml, directory set in the property either dfs.data.dir or dfs.datanode.data.dir will be used for DFS. All other used storage in the datanode will be considered as Non-DFS storage.

You can freed it up by deleting any unwanted files from the datanode machine such as hadoop logs, any non hadoop related files (other information on the disk), etc. It cannot be done by using any hadoop commands.

Non-DFS used is calculated by using following formula,

Non DFS used = ( Total Disk Space - Reserved Space) - Remaining Space - DFS Used

Find similar questions below,

What exactly Non DFS Used means?

0
votes

I was facing the same issue for a while now and my non-DFS usage had reached about 13TB!!! I tried many re-configurations for YARN, TEZ, MR2, etc but no success. Instead the usage just kept on increasing and my cluster usage reached almost 90%. This in turn led to a lot of vertex failures while running my scripts and re-attempts(failed) at configuring the system.

What worked for me though(funny story), was just a simple restart of all the data-nodes from Ambari!!! It cut the non-DFS usage from 13TB to just over 6TB. My resource manager had been up for about 160 days and I am guessing that restarting the data-nodes might have just cleared the log files.