Steps followed to reproduce the scenario
1) Created a file sample.txt
with the content with total size ~153B
cat sample.txt
This is xyz
This is my home
This is my PC
This is my room
This is ubuntu PC xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxxxxxxxxxxxxxxxxxxx
2) Added the property to hdfs-site.xml
<property>
<name>dfs.namenode.fs-limits.min-block-size</name>
<value>10</value>
</property>
and loaded into HDFS with block size as 64B
.
hdfs dfs -Ddfs.bytes-per-checksum=16 -Ddfs.blocksize=64 -put sample.txt /
This created three blocks of sizes 64B
, 64B
and 25B
.
Content in Block0
:
This is xyz
This is my home
This is my PC
This is my room
This i
Content in Block1
:
s ubuntu PC xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xx
Content in Block2
:
xx xxxxxxxxxxxxxxxxxxxxx
3) A simple mapper.py
#!/usr/bin/env python
import sys
for line in sys.stdin:
print line
4) Hadoop Streaming with 0
reducers:
yarn jar hadoop-streaming-2.7.1.jar -Dmapreduce.job.reduces=0 -file mapper.py -mapper mapper.py -input /sample.txt -output /splittest
Job ran with 3 input splits invoking 3 mappers and generated 3 output files with one file holding the entire content of sample.txt
and the rest 0B
files.
hdfs dfs -ls /splittest
-rw-r--r-- 3 user supergroup 0 2017-03-22 11:13 /splittest/_SUCCESS
-rw-r--r-- 3 user supergroup 168 2017-03-22 11:13 /splittest/part-00000
-rw-r--r-- 3 user supergroup 0 2017-03-22 11:13 /splittest/part-00001
-rw-r--r-- 3 user supergroup 0 2017-03-22 11:13 /splittest/part-00002
The file sample.txt
is split into 3 splits and these splits are assigned to each mapper as,
mapper1: start=0, length=64B
mapper2: start=64, length=64B
mapper3: start=128, length=25B
This only determines which portion of file has to be read by the mapper, not necessary that it has to be exact. The actual content that is read by a mapper is determined by the FileInputFormat and its boundaries, here TextFileInputFormat
.
This uses LineRecordReader
to read the content from each split and uses \n
as delimiter (line boundary). For a file that isn't compressed, the lines are read by each mapper as explained below.
For the mapper whose start index is 0, the line reading starts from the start of the split. If the split ends with \n
the reading ends at the split boundary else it looks for the first \n
post the length of the split assigned (here 64B
). Such that it does not end up processing a partial line.
For all the other mappers (start index != 0), it checks whether the preceding character from its start index (start - 1
) is \n
, if yes it reads the content from the start of the split else it skips the content that is present between its start index and the first \n
character encountered in that split (as this content is handled by other mapper) and starts to read from the first \n
.
Here, mapper1
(start index is 0) starts with Block0
whose split ends at the middle of a line. Thus, it continues to read the line which consumes the entire Block1
and since Block1
does not have a \n
character, mapper1
continues to read until it finds a \n
which ends with consuming of entire Block2
as well. That is how the entire content of sample.txt
ended up in single mapper output.
mapper2
(start index != 0), one character preceding to its start index is not a \n
, so skips the line and ends up with no content. Empty mapper output. mapper3
has the identical scenario as mapper2
.
sample.txt
This is xyz
This is my home
This is my PC
This is my room
This is ubuntu PC xxxx xxxx xxxx xxxx
xxxx xxxx xxxx xxxx xxxx xxxx xxxx
xxxxxxxxxxxxxxxxxxxxx