1
votes

I run K-Means using:

KMeansDriver.run(new Path("./bd.seq.file"), new Path(clustersLoc), new Path("output"),
            new EuclideanDistanceMeasure(), 0.001, 10, true, 0.5, false);

My aim is to know what cluster each of my original vectors belong to. From what I understand, this is supposed to be in output/clusteredPoints/part-m-00000, however this file looks like an empty (120 bytes) sequence file.

What gives?

1
Another clue I just found out. This happens only on Mahout 0.7. So, this is either a bug, or an undocumented change in behavior. In Mahout 0.5 I got a file under the path output/clusteredPoints /part-m-00000 containing the mapping of vector to cluster ...daniel_or_else

1 Answers

1
votes

OK, I finally got it (at least partially). It has to do with KMeansDriver.run() 8th parameter. If it has a value of '0' it behaves the same as in Mahout 0.5. The parameter's name is 'clusterClassificationThreshold' and its javadoc states:

Is a clustering strictness / outlier removal parrameter. Its value should be between 0 and 1. Vectors having pdf below this value will not be clustered.

For any Mahout beginners like me, pdf is acronym for "Probability density function". I'm not sure I really got what this parameter is (googling did not help here, the javadocs are ALL you're gonna get), but I guess that because it is part of a mechanism that filters the original vectors Mahout developers chose to disable the clustering points in case that it is not '0'.