Iterative MapReduce

Question

I've written a simple k-means clustering code for Hadoop (two separate programs - mapper and reducer). The code is working over a small dataset of 2d points on my local box. It's written in Python and I plan to use Streaming API.

I would like suggestions on how best to run this program on Hadoop.

After each run of mapper and reducer, new centres are generated. These centres are input for the next iteration.

From what I can see, each mapreduce iteration will have to be a separate mapreduce job. And it looks like I'll have to write another script (python/bash) to extract the new centres from HDFS after each reduce phase, and feed it back to mapper.

Any other easier, less messier way? If the cluster happens to use a fair scheduler, It will be very long before this computation completes?

London guy London guy · Accepted Answer · 2012-07-15T03:12:55

You needn't write another job. You can put the same job in a loop ( a while loop) and just keep changing the parameters of the job, so that when the mapper and reducer complete their processing, the control starts with creating a new configuration, and then you just automatically have an input file that is the output of the previous phase.

Iterative MapReduce

4 Answers