Hadoop - HDFS blocks and splits (while mapreduce is running)

Question

I want to get some clarification and confirmation about my understanding about blocks and input splits.

kindly read and let me know if I am correct.

When a file (say 1 GB in size) is copied from local file system to HDFS using "put" command, depending on the block size set in the hadoop's configuration files, it will get split (say 128 MB) into 8 blocks (1024 MB file/128MB block size) on 8 different data nodes. Also depending on the replication factor (say 3 times), it gets replicated onto 2 additional copies on different data nodes (understood about data locality). All this block information (file name, block name and data nodes where they are stored) is stored in the RAM on Name node. This information is not stored in the FSImage.

Is my understanding correct so far?

If I am correct so far, what does FSImage on hard disk has (what kind of content is in it)?

When we run a mapreduce job for this dataset, the driver program will split the blocks of data stored on datanodes into multiple "input splits" (the size is configured in xml files). In this case, say each input split is 128 MB, then we have 8 input splits and each split is assigned a Mapper process.

Is my understanding correct?

thanks much kind regards nath

54l3d 54l3d · Accepted Answer · 2016-06-08T00:24:51

When a file (say 1 GB in size) is copied from local file system to HDFS using "put" command, depending on the block size set in the hadoop's configuration files, it will get split (say 128 MB) into 8 blocks (1024 MB file/128MB block size) on 8 different data nodes. Also depending on the replication factor (say 3 times), it gets replicated onto 2 additional copies on different data nodes (understood about data locality).

=> That's correct.

All this block information (file name, block name and data nodes where they are stored) is stored in the RAM on Name node. This information is not stored in the FSImage.

=> All this information are stored in a transaction log called the EditLog.

what does FSImage on hard disk has (what kind of content is in it)?

=> It contains the full image of the file system.

For completeness of the answer, I have to add:

fsimage is a file that represents the file system state after all modifications up to a specific transaction ID.
EditLog is a log file that lists each file system change (file creation, deletion or modification) that was made after the most recent fsimage.
Checkpointing is the process of merging the content of the most recent fsimage with all edits applied after that fsimage is merged in order to create a new fsimage. Checkpointing is triggered automatically by configuration policies or manually by HDFS administration commands.
When starting Namenode, it loads the fsimage into memory to serve queries faster.

When we run a mapreduce job for this dataset, the driver program will split the blocks of data stored on datanodes into multiple "input splits" (the size is configured in xml files). In this case, say each input split is 128 MB, then we have 8 input splits and each split is assigned a Mapper process.

=> That's correct.

Hadoop - HDFS blocks and splits (while mapreduce is running)

1 Answers