I have a MR job running in EMR and it stores outputs currently in S3. The output from reducer will be the input to the same mapper (think identity mapper) and I would like to execute the successive runs as fast as possible instead of waiting for EMR to write to S3 and then schedule the mapper after 'x' mins to read the data. Write to and Read from S3 takes a significant time (~3--5 mins) and so I would like to know if there a way to avoid reading from S3 for my successive runs?
Also I need to write the output of a mapreduce job to S3 because that data is of importance to me and need to be persisted. However for each successive MR run I do not want to read from S3, instead can I write it to HDFS (or cache) and then use that as my input for the next run?
MultipleOutputs - help in outputting data to multiple files in a folder or writing to multiple folders. See - Writing output to different folders hadoop
How can I extend this concept to write to two different endpoints - S3 & HDFS?