I want to get some clarification and confirmation about my understanding about blocks and input splits.
kindly read and let me know if I am correct.
- When a file (say 1 GB in size) is copied from local file system to HDFS using "put" command, depending on the block size set in the hadoop's configuration files, it will get split (say 128 MB) into 8 blocks (1024 MB file/128MB block size) on 8 different data nodes. Also depending on the replication factor (say 3 times), it gets replicated onto 2 additional copies on different data nodes (understood about data locality). All this block information (file name, block name and data nodes where they are stored) is stored in the RAM on Name node. This information is not stored in the FSImage.
Is my understanding correct so far?
If I am correct so far, what does FSImage on hard disk has (what kind of content is in it)?
- When we run a mapreduce job for this dataset, the driver program will split the blocks of data stored on datanodes into multiple "input splits" (the size is configured in xml files). In this case, say each input split is 128 MB, then we have 8 input splits and each split is assigned a Mapper process.
Is my understanding correct?
thanks much kind regards nath