1
votes

I hope we can get advice from the smart people here

we have hadoop cluster and 5 data-nodes machines ( workers machines )

our HDFS size is almost 80T , and we have 98% used capacity !!!

from economic side we cant increase the HDFS size , by adding disks to the data-nodes

so we are thinking to decrease the HDFS replication factor from 3 to 2

lets do a simulation ,

if we decrease the hdfs replication factor from 3 to 2 , its means that we have only 2 backup of each data

but the question is - the third data that was create from previous 3 replication factor still exists in HDFS disks

so how HDFS know to delete the third data? or is it something that HDFS know to do?

or maybe - no any option to delete the old data that create because the previews replication factor ?

1

1 Answers

1
votes

In general 3 is the recommended replication factor. If you need to though, there's a command to change the replication factor of existing files in HDFS:

hdfs dfs -setrep -w <REPLICATION_FACTOR> <PATH>

The path can be a file or directory. So, to change the replication factor of all existing files from 3 to 2 you could use:

hdfs dfs -setrep -w 2 /

Note that -w will force the command to wait until the replication has changed for all files. With terabytes of data this will take a while.

To check that the replication factor has changed you can use hdfs fsck / and have a look at "Average block replication". It should have changed from 3 to 2.

Have a look at the command's docs for more details.

You can change the default replication factor which will be used for new files by updating hdfs-site.xml.