I'm new to Hadoop, but this has been a learning project of mine for the last month.
In an attempt to keep this vague enough to be useful to others, let me throw out the basic goal first.... Assume:
- You have a large data set (obviously), of millions of basic ASCII text files.
- Each file is a "record."
- The records are stored in a directory structure to identify customer & date
- e.g. /user/hduser/data/customer1/YYYY-MM-DD, /user/hduser/data/customer2/YYYY-MM-DD
- You want to mimic the input structure for the output structure
- e.g. /user/hduser/out/customer1/YYYY-MM-DD, /user/hduser/out/customer2/YYYY-MM-DD
I have looked at multiple threads:
- Multiple output path java hadoop mapreduce
- MultipleTextOutputFormat alternative in new api
- Separate Output files in Hadoop mapreduce
- Speculative Task Execution -- To try and solve the -m-part#### issue
And many more.. I've also been reading Tom White's Hadoop book. I've been eagerly trying to learn this. and I've frequently swapped between new API and old API, which is adding to the confusion of trying to learn this.
Many have pointed to MultipleOutputs (or the old api versions), but I seem to be unable to produce my desired output -- for instance, MultipleOutputs doesn't seem to accept a "/" to create a directory structure in write()
What steps need to be taken to create a file with the desired output structure? Currently I have a WholeFileInputFormat class, and related RecordReader that has a (NullWritable K, ByteWritable V) Pair (which can change if needed)
My map setup:
public class MapClass extends Mapper<NullWritable, BytesWritable, Text, BytesWritable> {
private Text filenameKey;
private MultipleOutputs<NullWritable, Text> mos;
@Override
protected void setup(Context context) throws IOException, InterruptedException {
InputSplit split = context.getInputSplit();
Path path = ((FileSplit) split).getPath();
filenameKey = new Text(path.toString().substring(38)); // bad hackjob, until i figure out a better way.. removes hdfs://master:port/user/hduser/path/
mos = new MultipleOutputs(context);
}
}
There is also a cleanup() function that calls mos.close(), and the map() function is currently an unknown (what I need help with here)
Is this enough information to point a newbie in the direction of an answer? My next thoughts were creating a MultipleOutputs() object in every map() task, each with a new baseoutput String, but I'm unsure if it is efficient or even the right kind of action to take.
Advice would be appreciated, anything in the program can change at this point except for the Input -- I'm just trying to learn the framework -- but I would like to get as close to this result as possible (later on I will probably look at combining records to larger files, but they are already 20MB per record, and I want to make sure it works before I make it impossible to read in Notepad
Edit: Could this problem be solved by modifying/extending the TextOutputFormat.class? It seems it might have some of the methods that could work, but I'm unsure which methods I'd need to override...