As helpful for others, using the tips above, I was able to do an implementation:
CustomOutputFormat<K, V> extends org.apache.hadoop.mapred.TextOutputFormat<K, V> {....}
with exactly one line of the built-in implementation of 'getRecordWriter' changed to:
String keyValueSeparator = job.get("mapred.textoutputformat.separator", "");
instead of:
String keyValueSeparator = job.get("mapred.textoutputformat.separator", "\t");
after compiling that into a Jar, and including it into my hadoop streaming call (via the instructions on hadoop streaming), the call looked like:
hadoop jar /usr/lib/hadoop/contrib/streaming/hadoop-streaming-1.0.3.jar \
-archives 'hdfs:///user/the/path/to/your/jar/onHDFS/theNameOfTheJar.jar' \
-libjars theNameOfTheJar.jar \
-outputformat com.yourcompanyHere.package.path.tojavafile.CustomOutputFormat \
-file yourMapper.py -mapper yourMapper.py \
-file yourReducer.py -reducer yourReducer.py \
-input $yourInputFile \
-output $yourOutputDirectoryOnHDFS
I also included the jar in the folder I issued that call from.
It was working great for my needs (and it created no tabs at the end of the line after the reducer).
update: based on a comment implying this is indeed helpful for others, here's the full source of my CustomOutputFormat.java file:
import java.io.DataOutputStream;
import java.io.IOException;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.compress.CompressionCodec;
import org.apache.hadoop.io.compress.GzipCodec;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.RecordWriter;
import org.apache.hadoop.mapred.TextOutputFormat;
import org.apache.hadoop.util.Progressable;
import org.apache.hadoop.util.ReflectionUtils;
public class CustomOutputFormat<K, V> extends TextOutputFormat<K, V> {
public RecordWriter<K, V> getRecordWriter(FileSystem ignored, JobConf job, String name,
Progressable progress) throws IOException {
boolean isCompressed = getCompressOutput(job);
//Channging the default from '\t' to blank
String keyValueSeparator = job.get("mapred.textoutputformat.separator", ""); // '\t'
if (!isCompressed) {
Path file = FileOutputFormat.getTaskOutputPath(job, name);
FileSystem fs = file.getFileSystem(job);
FSDataOutputStream fileOut = fs.create(file, progress);
return new LineRecordWriter<K, V>(fileOut, keyValueSeparator);
} else {
Class<? extends CompressionCodec> codecClass = getOutputCompressorClass(job,
GzipCodec.class);
// create the named codec
CompressionCodec codec = ReflectionUtils.newInstance(codecClass, job);
// build the filename including the extension
Path file = FileOutputFormat.getTaskOutputPath(job, name + codec.getDefaultExtension());
FileSystem fs = file.getFileSystem(job);
FSDataOutputStream fileOut = fs.create(file, progress);
return new LineRecordWriter<K, V>(new DataOutputStream(
codec.createOutputStream(fileOut)), keyValueSeparator);
}
}
}
FYI: For your usage context, be sure to check this does not adversely affect hadoop-streaming managed interactions (in terms of separating key vs. value) between your mapper and reducer. To clarify:
From my testing -- if you have a 'tab' in every line of your data (with something on each side of it), you can leave the built in defaults as they are: streaming will interpret the first thing before the first tab as your 'key', and all on that row after it as your 'value.' As such, it does not see a 'null value,' and won't append a tab that shows up after your reducer. (You'll see your final outputs sorted on the value of the 'key' that streaming interprets in each row as what it sees as occuring before each tab.)
Conversely, if you have no tabs in your data, and you don't override the defaults using the above trick(s), then you'll see the tabs after the run completes, for which the above override becomes a fix.
a|b|c
, I outputa\tb\tc
and then just tell AWS Redshift to delimit on\t
instead of on|
). – Eddified