Append the same string (previous result) on each splitted mapreduce input file in iterative job with hadoop

Question

I am new with Hadoop and I am writing an iterative MapReduce job.

I know that with Hadoop, starting from a big Dataset it will be split in small files and than send them as input to mapfunction at different machines.

I just have success to append the result of MapReduce at the end of the output file, but in this way with an iterative job this result will be send only to one machine.

So I want append the result to EACH splitted file sent to each machine so any machine can see the previous result.

How can I do it?

srinivasan Hariharan srinivasan Hariharan · Accepted Answer · 2014-05-26T14:16:58

In your Map method you can append the output to one common HDFS file other than writing to context object. But if multiple map tasks tries to append the file you will get error.

Workaround:

After each Iteration of MR job append the output to the temp file in the tmp directory.
Move this temp file to hdfs (using Java Hadoop filestatus API)
In next iteration add this temp file loaded in hdfs to distributed cache.
Read the distributed cache file from map task.

Please let me know if you need further help.

Update temp file Logic

 public void appendtempdate(String tempfile,String data)
 {
  try
  {
  File temp = new File(tempfile);
  if(!temp.exists())
  {
    temp.createNewFile();
  }
            FileWriter fw= new FileWriter(temp.getName(),true);
            BufferedWriter bw= new BufferedWriter(fw);
            bw.write(data);
            bw.close();
   }
    catch(Execption e)
    {
      }
    }

Call this method and move temp file to HDFS for distributed cache.

Append the same string (previous result) on each splitted mapreduce input file in iterative job with hadoop

1 Answers