I've written a simple k-means clustering code for Hadoop (two separate programs - mapper and reducer). The code is working over a small dataset of 2d points on my local box. It's written in Python and I plan to use Streaming API.
I would like suggestions on how best to run this program on Hadoop.
After each run of mapper and reducer, new centres are generated. These centres are input for the next iteration.
From what I can see, each mapreduce iteration will have to be a separate mapreduce job. And it looks like I'll have to write another script (python/bash) to extract the new centres from HDFS after each reduce phase, and feed it back to mapper.
Any other easier, less messier way? If the cluster happens to use a fair scheduler, It will be very long before this computation completes?