Hadoop streaming - remove trailing tab from reducer output

Question

I have a hadoop streaming job whose output does not contain key/value pairs. You can think of it as value-only pairs or key-only pairs.

My streaming reducer (a php script) is outputting records separated by newlines. Hadoop streaming treats this as a key with no value, and inserts a tab before the newline. This extra tab is unwanted.

How do I remove it?

I am using hadoop 1.0.3 with AWS EMR. I downloaded the source of hadoop 1.0.3 and found this code in hadoop-1.0.3/src/contrib/streaming/src/java/org/apache/hadoop/streaming/PipeReducer.java :

reduceOutFieldSeparator = job_.get("stream.reduce.output.field.separator", "\t").getBytes("UTF-8");

So I tried passing -D stream.reduce.output.field.separator= as an argument to the job with no luck. I also tried -D mapred.textoutputformat.separator= and -D mapreduce.output.textoutputformat.separator= with no luck.

I've searched google of course and nothing I found worked. One search result even stated there was no argument that could be passed to achieve the desired result (though, the hadoop version in that case was really really old).

Here is my code (with added line breaks for readability):

hadoop jar streaming.jar -files s3n://path/to/a/file.json#file.json
    -D mapred.output.compress=true -D stream.reduce.output.field.separator=
    -input s3n://path/to/some/input/*/* -output hdfs:///path/to/output/dir
    -mapper 'php my_mapper.php' -reducer 'php my_reducer.php'

I used a workaround. My output did have several fields, but there was no key/value concept (as explained). The output is sent to AWS Redshift, which needs only one type of field delimiter. So, by choosing to use tabs (instead of what I was previously using), I "tricked" the hadoop streaming jar into not adding an extra trailing tab. (Ie, instead of outputting a|b|c, I output a\tb\tc and then just tell AWS Redshift to delimit on \t instead of on |). — Eddified

libjack libjack · Accepted Answer · 2013-08-09T17:12:58

Looking at the org.apache.hadoop.mapreduce.lib.output.TextOutputFormat source, I see 2 things:

The write(key,value) method writes a separator if key or value is non-null
The separator is always set, using the default (\t), when the mapred.textoutputformat.separator returns null (which I'm assuming happens with -D stream.reduce.output.field.separator=

Your only solution maybe to write your own OutputFormat that works around these 2 issues.

My testing

In a task I had, I wanted to reformat a line from

id1|val1|val2|val3
id1|val1

into:

id1|val1,val2,val3
id2|val1

I had a custom mapper (Perl script) to convert the lines. And for this task, I initially tried to do as a key-only (or value-only) input, but got the results with the trailing tab.

At first I just specified:

-D stream.map.input.field.separator='|' -D stream.map.output.field.separator='|'

This gave the mapper a key, value pair, since my mapping wanted a key anyway. But this output now had the tab after the first field

I got the desired output when I added:

-D mapred.textoutputformat.separator='|'

If I didn't set it or set to blank

-D mapred.textoutputformat.separator=

then I would again get a tab after the first field.

It made sense once I looked at the source for TextOutputFormat

Hadoop streaming - remove trailing tab from reducer output

3 Answers

My testing