I know that the tab is default input separator for fields :
stream.map.output.field.separator
stream.reduce.input.field.separator
stream.reduce.output.field.separator
mapreduce.textoutputformat.separator
but if i try to write the generic parser option :
stream.map.output.field.separator=\t (or)
stream.map.output.field.separator="\t"
to test how hadoop parses white space characters like "\t,\n,\f" when used as separators. I observed that hadoop reads it as \t character but not "" tab space itself. I checked it by printing each line in reducer (python) as it reads using :
sys.stdout.write(str(line))
My mapper emits key/value pairs as : key value1 value2
using print (key,value1,value2,sep='\t',end='\n')
command.
So I expected my reducer to read each line as : key value1 value2
too, but instead sys.stdout.write(str(line))
printed :
key value1 value2 \\with trailing space
From Hadoop streaming - remove trailing tab from reducer output, I understood that the trailing space is due to mapreduce.textoutputformat.separator
not being set and left as default.
So, this confirmed my assumption that hadoop considered my total map output :
key value1 value2
as key and value as empty Text object since it read the separator from stream.map.output.field.separator=\t
as "\t" character instead of "" tab space itself.
Please help me understand this behavior and how can I use \t as a separator if I want to.