13
votes

I have what feels like a simple problem, but I can't seem to find an answer. I'm pretty new to Weka, but I feel like I've done a bit of research on this (at least read through the first couple of pages of Google results) and come up dry.

I am using Weka to run clustering using Simple K-Means. In the results list I have no problem visualizing my output ("Visualize cluster assignments") and it is clear both from my understanding of the K-Means algorithm and the output of Weka that each of my instances is ending up as a member of a different cluster (centered around a particular centroid, if you will).

I can see something of the cluster composition from the text output. However Weka provides me with no explicit "mapping" from instance number to cluster number. I would like something like:

instance 1 --> cluster 0
instance 2 --> cluster 0
instance 3 --> cluster 2
instance 4 --> cluster 1
... etc.

How do I obtain these results without calculating the distance from each item to each centroid on my own?

2

2 Answers

14
votes

I had the same problem and figured it out. I am posting the method here if anyone needs to know :

Its actually quite simple, you have to use Weka's java api.

SimpleKMeans kmeans = new SimpleKMeans();

kmeans.setSeed(10);

// This is the important parameter to set
kmeans.setPreserveInstancesOrder(true);
kmeans.setNumClusters(numberOfClusters);
kmeans.buildClusterer(instances);

// This array returns the cluster number (starting with 0) for each instance
// The array has as many elements as the number of instances
int[] assignments = kmeans.getAssignments();

int i=0;
for(int clusterNum : assignments) {
    System.out.printf("Instance %d -> Cluster %d", i, clusterNum);
    i++;
}
9
votes

Aha, I think I found what I was looking for. Under the cluster visualizer, click "Save". This saves the whole data set as an ARFF file almost identical to the input file I provided, but with 2 new attributes: the first attribute is the index of the instance, while the last attribute is the cluster assignment. Now I just have to parse the crap out of it!