I am trying to reduce the latency of mapreduce job in my data stream, and I want to continuously tail the output of reducer part-xxxx file using hdfs-api instead of reading it after the job completes. but I am wondering this is safe for hadoop jobs?
1 Answers
When you use a FileOutputFormat
based output formats (Text, Sequence file etx), they utilize a common FileOutputCommitter
which is responsible for committing or aborting a reducers output when it succeeds / fails etc.
Behind the scenes, when your reducer is writing output, it is written to a _temporary subdirectory of your designated HDFS output directory.
When the reducer completes, the job tracker will denote one specific instance of that reducer attempt (remember with speculative execution, a reducer task attempt may be run 1 or more times) as the final output and signal the output committer to commit that version of the reducer output (the other attempts will be aborted).
When an output committer commits an attempt output, it merely moves the part-r-xxxxx file from the attempt temporary directory to the designated output directory.
So with this in mind, when you see part-r-* files in your output directory, then are fully written and are safe to tail. So in this sense you can get the jump on processing your reducer output (say you have 10K reducers running on a 1000 reducer slot cluster) - but you cannot schedule another map/reduce job to process this output yet as only the reducer output that has finished will be used in the next map reduce job (when the job is submitted it will only consider the files currently available as input, it will not continually scan for new files that appear after job submission).
You should also consider that you job may actually fail in the last few reducers - in this case do you still want to have eagerly processed the reducer outputs of those that had completed before failure, or do you only want to process if the entire job completes (which makes more sense for most jobs).