As it stands out MultipleTextOutputFormat have not been migrated to the new API. So if we need to choose an output directory and output fiename based on the key-value being written on the fly, then what's the alternative we have with new mapreduce API ?
3 Answers
I'm using AWS EMR Hadoop 1.0.3, and it is possible to specify different directories and files based on k/v pairs. Use either of the following functions from the MultipleOutputs
class:
public void write(KEYOUT key, VALUEOUT value, String baseOutputPath)
or
public <K,V> void write(String namedOutput, K key, V value,
String baseOutputPath)
The former write
method requires the key to be the same type as the map output key (in case you are using this in the mapper) or the same type as the reduce output key (in case you are using this in the reducer). The value must also be typed in similar fashion.
The latter write
method requires the key/value types to match the types specified when you setup the MultipleObjects static properties using the addNamedOutput
function:
public static void addNamedOutput(Job job,
String namedOutput,
Class<? extends OutputFormat> outputFormatClass,
Class<?> keyClass,
Class<?> valueClass)
So if you need different output types than the Context
is using, you must use the latter write
method.
The trick to getting different output directories is to pass a baseOutputPath
that contains a directory separator, like this:
multipleOutputs.write("output1", key, value, "dir1/part");
In my case, this created files named "dir1/part-r-00000".
I was not successful in using a baseOutputPath
that contains the ..
directory, so all baseOutputPath
s are strictly contained in the path passed to the -output
parameter.
For more details on how to setup and properly use MultipleOutputs, see this code I found (not mine, but I found it very helpful; does not use different output directories). https://github.com/rystsov/learning-hadoop/blob/master/src/main/java/com/twitter/rystsov/mr/MultipulOutputExample.java
Similar to: Hadoop Reducer: How can I output to multiple directories using speculative execution?
Basically you can write to HDFS directly from your reducer - you'll just need to be wary of speculative execution and name your files uniquely, then you'll need to implement you own OutputCommitter to clean up the aborted attempts (this is the most difficult part if you have truely dynamic output folders - you'll need to step through each folder and delete the attemps associated with aborted / failed tasks). A simple solution to this is to turn off speculative execution
For the best answer,turn to Hadoop - definitive guide 3rd Ed.(starting pg. 253.)
An Excerpt from the HDG book -
"In the old MapReduce API, there are two classes for producing multiple outputs: MultipleOutputFormat and MultipleOutputs. In a nutshell, MultipleOutputs is more fully featured, but MultipleOutputFormat has more control over the output directory structure and file naming. MultipleOutputs in the new API combines the best features of the two multiple output classes in the old API."
It has an example on how you can control directory structure,file naming and output format using MultipleOutputs API.
HTH.