While reading the book Hadoop: The Definitive Guide, I came across this page with the following line:
The namenode also knows the datanodes on which all the blocks for a given file are located, however, it does not store block locations persistently, since this information is reconstructed from datanodes when the system starts.
I am struggling to understand how this works. Let's say, that I copy a 1 GB file on an 8 node cluster with replication factor of 3. So each datanode will have 1 block and these blocks will be replicated on other nodes, bringing the total number of blocks on each node effectively to 3. Now the namenode is supposed to keep an index containing the location of each block. But according to the text, if the namenode does not store block locations persistently, how are they reconstructed after the cluster is shut down and restarted. There will be no way of telling which block belongs to which file. Can someone please explain this to me?