Cannot get clustering output Mahout

Question

I am running kmeans in Mahout and as an output I get folders clusters-x, clusters-x-final and clusteredPoints.

If I understood well, clusters-x are centroid locations in each of iterations, clusters-x-final are final centroid locations, and clusteredPoints should be the points being clustered with cluster id and weight which represents probability of belonging to cluster (depending on the distance between point and its centroid). On the other hand, clusters-x and clusters-x-final contain clusters centroids, number of elements, features values of centroid and the radius of the cluster (distance between centroid and its farthest point.

How do I examine this outputs?

I used cluster dumper successfully for clusters-x and clusters-x-final from terminal, but when I used it clusteredPoints, I got an empty file? What seems to be the problem?

And how can I get this values from code? I mean, the centroid values and points belonging to clusters?

FOr clusteredPoint I used IntWritable as key, and WeightedPropertyVectorWritable for value, in a while loop, but it passes the loop like there are no elements in clusteredPoints?

This is even more strange because the file that I get with clusterDumper is empty?

What could be the problem?

Any help would be greatly appreciated!

Navnish Bhardwaj Navnish Bhardwaj · Accepted Answer · 2014-09-04T07:10:14

I believe your interpretation of the data is correct (I've only been working with Mahout for ~3 weeks, so someone more seasoned should probably weigh in on this).

As far as linking points back to the input that created them I've used NamedVector, where the name is the key for the vector. When you read one of the generated points files (clusteredPoints) you can convert each row (point vector) back into a NamedVector and retrieve the name using .getName().

Update in response to comment

When you initially read your data into Mahout, you convert it into a collection of vectors with which you then write to a file (points) for use in the clustering algorithms later. Mahout gives you several Vector types which you can use, but they also give you access to a Vector wrapper class called NamedVector which will allow you to identify each vector.

For example, you could create each NamedVector as follows:

NamedVector nVec = new NamedVector(
    new SequentialAccessSparseVector(vectorDimensions), 
    vectorName
    );

Then you write your collection of NamedVectors to file with something like:

SequenceFile.Writer writer = new SequenceFile.Writer(...);
VectorWritable writable = new VectorWritable();

// the next two lines will be in a loop, but I'm omitting it for clarity
writable.set(nVec);
writer.append(new Text(nVec.getName()), nVec);

You can now use this file as input to one of the clustering algorithms.

After having run one of the clustering algorithms with your points file, it will have generated yet another points file, but it will be in a directory named clusteredPoints.

You can then read in this points file and extract the name you associated to each vector. It'll look something like this:

IntWritable clusterId = new IntWritable();
WeightedPropertyVectorWritable vector = new WeightedPropertyVectorWritable();

while (reader.next(clusterId, vector))
{
    NamedVector nVec = (NamedVector)vector.getVector();
    // you now have access to the original name using nVec.getName()
}

Cannot get clustering output Mahout

2 Answers