3
votes

I've written a simple k-means clustering code for Hadoop (two separate programs - mapper and reducer). The code is working over a small dataset of 2d points on my local box. It's written in Python and I plan to use Streaming API.

I would like suggestions on how best to run this program on Hadoop.

After each run of mapper and reducer, new centres are generated. These centres are input for the next iteration.

From what I can see, each mapreduce iteration will have to be a separate mapreduce job. And it looks like I'll have to write another script (python/bash) to extract the new centres from HDFS after each reduce phase, and feed it back to mapper.

Any other easier, less messier way? If the cluster happens to use a fair scheduler, It will be very long before this computation completes?

4

4 Answers

1
votes

You needn't write another job. You can put the same job in a loop ( a while loop) and just keep changing the parameters of the job, so that when the mapper and reducer complete their processing, the control starts with creating a new configuration, and then you just automatically have an input file that is the output of the previous phase.

0
votes

The Java interface of Hadoop has the concept of chaining several jobs: http://developer.yahoo.com/hadoop/tutorial/module4.html#chaining

However, since you're using Hadoop Streaming you don't have any support for chaining jobs and managing workflows.

You should checkout Oozie which should do the job for you: http://yahoo.github.com/oozie/

0
votes

Here are a few ways to do it: github.com/bwhite/hadoop_vision/tree/master/kmeans

Also check this out (has oozie support): http://bwhite.github.com/hadoopy/

0
votes

Feels funny to be answering my own question. I used PIG 0.9 (not released yet, but available in the trunk). In this, there is support for modularity and flow control by way of allowing PIG Statements to be embedded inside scripting languages like Python.

So, I wrote a main python script that had a loop, and inside that called my PIG Scripts. The PIG scripts inturn made calls to the UDFs. So, had to write three different programs. But it worked out fine.

You can check the example here - http://www.mail-archive.com/[email protected]/msg00672.html

For the record, my UDFs were also written in Python, using this new feature that allows writing UDFs in scripting languages.