1
votes

I have written a K-Means Clustering code in MapReduce on Hadoop. If I have few number of clusters, consider 2, and if the data is very large, the whole data would be divided into two sets and each Reducer would receive too many values for a particular key, i.e the cluster centroid. How to solve this?

Note: I use the iterative approch to calculate new centers.

2
Reducer would receive too many values for a particular key, so what? - luoluo
If the data size is very huge, reducer will have to loop through a lot of values right? So won't it be too much work for a single computer? - Punit Naik
Secondary sort in Mapreduce gist.github.com/airawat/6604175 - luoluo
I thank you for the anser but I cannot seem to understand how will I apply this to my problem. - Punit Naik
Will it be the same as using a combiner function? - Punit Naik

2 Answers

2
votes

Algorithmically, there is not much you can do, as the nature of this algorithm is the one that you describe. The only option, in this respect, is to use more clusters and divide your data to more reducers, but this yields a different result.

So, the only thing that you can do, in my opinion, is compressing. And I do not only mean, using a compression codec of Hadoop.

For instance, you could find a compact representation of your data. E.g., give an integer id to each element and only pass this id to the reducers. This will save network traffic (store elements as VIntWritables, or define your own VIntArrayWritable extending ArrayWritable) and memory of each reducer.

In this case of k-means, I think that a combiner is not applicable, but if it is, it would greatly reduce the network and the reducer's overhead.

EDIT: It seems that you CAN use a combiner, if you follow this iterative implementation. Please, edit your question to describe the algorithm that you have implemented.

-1
votes

If you have too much shuffle then you will run into OOM issues.

Try to split the dataset in smaller chunks and try

yarn.app.mapreduce.client.job.retry-interval
AND mapreduce.reduce.shuffle.retry-delay.max.ms

where there are more splits but the retries of the job will be long enough so there is no OOM issues.