0
votes

I do not know if it is possible to find it, but I am using Kmeans clustering with Mahout, and I am stuck to the following.

In my implementation, I create with two different threads the following clusters:

CL-1{n=4 c=[1.75] r=[0.82916]}

CL-1{n=2 c=[4.5] r=[0.5]}

So, I would like to finally combine these two clusters into one final cluster. In my code, I manage to find that for the final cluster the total points are n=6, the new average of the centers is c=2.666 but I am not able to find the final combined radius.

I know that the radius is the Population Standard Deviation, and I can calculate it if I previously know each point that belongs to the cluster.

However, in my case I do not have previous knowledge of the points, so I need the "average" of the 2 radius I mentioned before, in order to finally have this: CL-1{n=6 c=[2.666] r=[???]}.

Any ideas? Thanks for your help.

1
I'd recommend also posting this the math stack exchange (math.stackexchange.com). They might have more insight on how to solve thisChris
Done! Thanks for your advice! I hope someone will help me with this :)Tha Q
Is the problem that you don't know which points belong to the new cluster? I do not expect that the new standard deviation will be the average of the old. The new mean (center) is not likely to be the mean of the old centers either (unless both clusters contained the same number of points).TravisJ
If you start by taking a weighted average for the new mean, then you can at least get an upper bound on the radius of the new cluster. If the first cluster's center is d_1 units from the new center and the second cluster's center is d_2 units away (clusters had r_1 and r_2 as radii), then new_R <= max{ d_1+r_1, d_2+r_2 }TravisJ
As for the new center, I am pretty sure that it is the average between the numbers 1.75,1.75,1.75,1.75,4.5,4.5 (the first four numbers are the average of the "first" cluster, and the other two are the average of the "second" cluster. I can prove this, because I already know the correct final results. However, it is not the same with the radius. For sure, it is not the average, but I know that the final result of the radius should be 1.49071. If I knew the points, this would solve my problem, but I only have this information. It seems complicated :/Tha Q

1 Answers

0
votes

It's not hard. Remember how the "radius" (not a very good name) is computed.

It's probably the standard deviation; so if you square this value and multiply it by the number of objects, you get the sum of squares. You can aggregate the sum of squares, and then reverse this process to get a standard deviation again. It's pretty basic statistic knowledge; you want to compute the weighted quadratic mean, just like you computed the weighted arithmetic mean for the center.

However, since your data is 1 dimensional, I'm pretty sure it will fit into main memory. As long as your data fits into memory, stay away from Mahout. It's slooooow. Use something like ELKI instead, or SciPy, or R. Run benchmarks. Mahout will perform several orders of magnitude slower than all the others. You won't need all of this Canopy-thing then either.