1
votes

I try to run and understand the results of SimpleKMeans algorithm in weka.

This is my training data

@relation weather_clustered

@attribute Instance_number numeric
@attribute outlook {sunny,overcast,rainy}
@attribute temperature numeric
@attribute humidity numeric
@attribute windy {TRUE,FALSE}
@attribute play {yes,no}
@attribute cluster {cluster0,cluster1,cluster2,cluster3,cluster4,cluster5}

@data
0,sunny,85,85,FALSE,no,cluster3
1,sunny,80,90,TRUE,no,cluster5
2,overcast,83,86,FALSE,yes,cluster2
4,rainy,68,80,FALSE,yes,cluster4

Then I run SimpleKMeans with numClusters=2 seed=10. I do want to see clustering results regarding attribute cluster, in other words I want to see which cluster attribute clusterx relates to. As you see I don't assume that attribute cluster is the right clustering.

In order to see the correspondence in the output, I set Classes to cluster evaluation = (Nom) cluster

and get the following results

Class attribute: cluster Classes to Clusters:

0 1  <-- assigned to cluster
 0 0 | cluster0
 0 0 | cluster1
 1 0 | cluster2
 0 1 | cluster3
 1 0 | cluster4
 0 1 | cluster5

Cluster 0 <-- cluster2
Cluster 1 <-- cluster3

Incorrectly clustered instances :   2.0  50      %

I do like the list with correspondence, this exactly what I need, however I don't understand what's the following means

Cluster 0 <-- cluster2
Cluster 1 <-- cluster3

In addition, I am confused by the following result

Incorrectly clustered instances :   2.0  50      %

Where it comes from, how weka knows the correct result, I don't have a correct result, maybe it confuses the attribute cluster with correct cluster. In short I don't understand the output.

1

1 Answers

1
votes

SimpleKMeans is a clustering algorithm that groups your data in K clusters.

In your case, having numClusters=2 => K=2, will result in grouping your data into 2 clusters.

Cluster 1

Cluster 2

When you selected classes to clusters evaluation, Weka does the following:

  1. Removes the attribute values that you selected for evaluation. In your case the cluster class data.

  2. Applies the KMeans algorithms without using any information from your cluster attribute.

  3. Evaluates your cluster by using your initial dataset (with your cluster attribute)

So, in your case,

cluster1, cluster2.., cluster5,

act as labels for your instances and will be used for testing your model.

To better understand the output, you have

@data
0,sunny,85,85,FALSE,no,cluster3
1,sunny,80,90,TRUE,no,cluster5
2,overcast,83,86,FALSE,yes,cluster2
4,rainy,68,80,FALSE,yes,cluster4

and

Cluster 0 <-- cluster2
Cluster 1 <-- cluster3


Incorrectly clustered instances :   2.0  50      %

As you can see, there are 2 incorrectly clustered instances:

1,sunny,80,90,TRUE,no,cluster5
4,rainy,68,80,FALSE,yes,cluster4

And there are 50% incorrectly clusteres instances because you have a total of 4 instances, from which 2 are incorrectly clustered (2 = 50% of 4)