Name Node contains the meta data of the entire cluster. It contains the details of each folder, file, replication factor, block names etc. The Name Node also stores the information about the location of the blocks for each file (this information is constructed from the Block Reports sent by the Data Nodes) in memory.
Data Nodes store following information for each block:
- Actual data stored in the block
- Meta data for the data stored in the block. Mainly contains checksums for the data stored in the block.
They periodically send the heart beat and block reports to the Name Node.
Heart Beat:
- Interval of heart beat reports is determined by configuration parameter
dfs.heartbeat.interval
(in hdfs-site.xml). By default this is set to 3 seconds.
- Some of the information contained in the Heart beat is:
- Registration: Data node registration information
- Capacity: Total storage capacity available at Data Node
- dfsUsed: Storage used by HDFS
- remaining: Remaining storage available for HDFS
- blockPoolUsed: Storage used by the block pool
- xmitsInProgress: Number of transfers from this Data Node to others
- xceiverCount: Number of active transceiver threads
- xmitsInProgress: Number of transfers from this Data Node to others
- cacheCapacity: Total cache capacity available at Data Node
- cacheUsed: Amount of cache used
- This information is used by the Name Node in the following ways:
- Health of the Data Node: Should this data node be marked as dead or alive?
- Registration of new Data Node: If this is a newly added Data Node, its information is registered
- Update the metrics of the Data Node: The information sent in the heart beat is used for updating the metrics of the node
- Issue commands to the Data Node: The Name Node can issue following commands to the Data Node, based on the information received in the heart beat:
BlockRecoveryCommand
(to recover specified blocks), BlockCommand
(for transferring blocks to another Data Node, for invalidating certain blocks), Cache/Uncache
(commands for caching / uncaching the blocks)
Block Reports:
- Interval of block reports is determined by configuration
dfs.blockreport.intervalMsec
(in hdfs-site.xml). By default this is set to 21600000 milliseconds.
- Some of the information contained in the block report is:
- Registration: Data node registration information
- blocks: Information about the blocks, which contains: block ID, block length, block generation timestamp, state of the block replica (For e.g. replica is finalized or waiting to be recovered etc.)
- This information is used by the Name Node for:
- Process first block report: If it is a first time report for the newly registered Data Node, it just adds all the valid replicas. It ignores all the invalid blocks, till the next block report.
- For updating the information about blocks: The (Data Node -> Blocks) map is updated in the Name Node. The new block report is compared with the old report and information about successful blocks, corrupted blocks, invalidated blocks etc. is updated