2
votes

I need to stored a large file of about 10TB on HDFS. What i need to understand is how HDFS will store this file. Say, The replication factor for the cluster is 3 and I have a 10 node cluster with over 10 TB of disk space on each node i.e total cluster capacity is over 100TB.

Now does HDFS choose three nodes at random and store the file on these three nodes. So then this is as simple as it sounds. Please confirm?

Or does HDFS split the file - say in to 10 split of 1TB each and then stores each of the split on 3 nodes chosen at random. So is spliting possible and if yes is it a configuration aspect through which it is enabled. And if HDFS has to split a binary or text file - how does it split. Simply by bytes.

1
Unless the format you're going to use is splitable, this is a bad idea. From HDFS's perspective it doesn't matter, but for MapReduce if it isn't splitable only one mapper will be able to process said file. - Binary Nerd

1 Answers

8
votes

Yes, it splits the file (in 128mb blocks by default). Every block will be stored on 3 random nodes. As a result you'll have 30TB of data evenly distributed over your 10 nodes.