hadoop 2.4.0 streaming generic parser options using TAB as separator

Question

I know that the tab is default input separator for fields :

stream.map.output.field.separator
stream.reduce.input.field.separator
stream.reduce.output.field.separator
mapreduce.textoutputformat.separator

but if i try to write the generic parser option :

stream.map.output.field.separator=\t (or)  
stream.map.output.field.separator="\t"

to test how hadoop parses white space characters like "\t,\n,\f" when used as separators. I observed that hadoop reads it as \t character but not "" tab space itself. I checked it by printing each line in reducer (python) as it reads using :

sys.stdout.write(str(line))

My mapper emits key/value pairs as : key value1 value2

using print (key,value1,value2,sep='\t',end='\n') command.

So I expected my reducer to read each line as : key value1 value2 too, but instead sys.stdout.write(str(line)) printed :

key value1 value2 \\with trailing space

From Hadoop streaming - remove trailing tab from reducer output, I understood that the trailing space is due to mapreduce.textoutputformat.separator not being set and left as default.

So, this confirmed my assumption that hadoop considered my total map output :

key value1 value2

as key and value as empty Text object since it read the separator from stream.map.output.field.separator=\t as "\t" character instead of "" tab space itself.

Please help me understand this behavior and how can I use \t as a separator if I want to.

Ramzy Ramzy · Accepted Answer · 2015-06-04T18:53:40

You might be having this issue "-D stream.map.output.field.separator=." specifies "." as the field separator for the map outputs, and the prefix up to the fourth "." in a line will be the key and the rest of the line (excluding the fourth ".") will be the value. If a line has less than four "."s, then the whole line will be the key and the value will be an empty Text object (like the one created by new Text("")). Here it is clearly mentioned how the separator is being used, and also how many of such separator occurences needs to be considered, when identifying map key and value. Also there are fields related to partitioning, based on which the reducer will be handled. As you want the separator to be changed, I think, you have to verify this also related to partitioning and reducer.

hadoop 2.4.0 streaming generic parser options using TAB as separator

1 Answers