3
votes

While reading the book Hadoop: The Definitive Guide, I came across this page with the following line:

The namenode also knows the datanodes on which all the blocks for a given file are located, however, it does not store block locations persistently, since this information is reconstructed from datanodes when the system starts.

I am struggling to understand how this works. Let's say, that I copy a 1 GB file on an 8 node cluster with replication factor of 3. So each datanode will have 1 block and these blocks will be replicated on other nodes, bringing the total number of blocks on each node effectively to 3. Now the namenode is supposed to keep an index containing the location of each block. But according to the text, if the namenode does not store block locations persistently, how are they reconstructed after the cluster is shut down and restarted. There will be no way of telling which block belongs to which file. Can someone please explain this to me?

2

2 Answers

3
votes

The namenode does preserve some state about the files (name, path, size, block size, block IDs etc), just not eh physical location of where the blocks are.

When the data nodes start up, they effectively tree walk the dfs data directory discovering all the file blocks they have and once complete, reports to the name node the blocks that it hosts.

The namenode builds up a map of the files to block locations from the reports from each data node.

This is one of the reasons it sometimes takes a few minutes to come out of safe mode when the cluster first starts up - if you have lots of files, it can take a few moments for each data node to tree walk and discover the blocks it hosts.

-1
votes

Each fsimage file contains a serialized form of all the directory and file inodes in the filesystem. Each inode is an internal representation of a file or directory’s metadata and contains such information as the file’s replication level, modification and access times, access permissions, block size, and the blocks the file is made up of. For directories, the modification time, permissions, and quota metadata are stored.An fsimage file does not record the datanodes on which the blocks are stored. Instead, the namenode keeps this mapping in memory, which it constructs by asking the datanodes for their block lists when they join the cluster and periodically afterward to ensure the namenode’s block mapping is up to date.