1
votes

Hi have found this explanation in this link:

  • Use training set (default). After generating the clustering Weka
    classifies the training instances into clusters according to the
    cluster representation and computes the percentage of instances
    falling in each cluster.

  • In Supplied test set or Percentage split Weka can evaluate
    clusterings on separate test data if the cluster representation is probabilistic (e.g. for EM).

  • Classes to clusters evaluation. In this mode Weka first ignores
    the class attribute and generates the clustering. Then during
    the test phase it assigns classes to the clusters, based on the majority value of the class attribute within each cluster. Then it computes the classification error, based on this assignment and also shows the corresponding confusion matrix. An example
    of this for k-means is shown below.

However, it is not clear for me how WEKA classifies a new instance in a cluster and how it measures if that instance should be classified in that cluster or not. Or does if only claculate the distribution of test instances in each cluster, without judging if the assignment is correct or not?

1

1 Answers

1
votes

The notion of cluster assignment depends on the particular clustering algorithm. Let´s consider the simplest case: hard clustering with k-means. In such case the assignment is done based on the nearest cluster centroid to the point in question.

With respect to the new test instance the options mean the following :

  • Use training set (default)

Since clustering is usually performed in an unsupervised manner, the default option means that after the clustering algorithm is finished Weka will generate a list with how many instances have been assigned to each cluster (also in percents). No new test instances are provided in such case, the same instances from training set are used to calculate the final outcome of clustering

  • Classes to cluster evaluation

This option allows a user to really test the results of clustering by means of providing class labels to the training instances (which means the dataset with labels should be available). In this mode Weka first runs an unsupervised clustering algorithm without taking into consideration a class label and after that calculates how many instances with a particular label have been clustered to each cluster based on the majority of instances of some class in each cluster. It also generates a confusion matrix as in a standard supervised classification task. However, no separate test set is used since evaluation is performed based on the provided class labels within the same training set so there are no new test instances either.

  • Supplied test set or Percentage split

By choosing this option, user can provide a separate test set (or just exploit the split of available dataset into training and test sets). In this case, Weka first runs a clustering algorithm on a training set and uses the test set to assign a cluster to the new test instances. The result that is generated at the end is the same as for default option (Use training set) but it is calculated based only on the instances taken from the test set.

Hope this helps!