13
votes

Many algorithms for clustering are available. A popular algorithm is the K-means where, based on a given number of clusters, the algorithm iterates to find best clusters for the objects.

What method do you use to determine the number of clusters in the data in k-means clustering?

Does any package available in R contain the V-fold cross-validation method for determining the right number of clusters?

Another well used approach is Expectation Maximization (EM) algorithm which assigns a probability distribution to each instance which indicates the probability of it belonging to each of the clusters.

Is this algorithm implemented in R?

If it is, does it have the option to automatically select the optimum number of clusters by cross validation?

Do you prefer some other clustering method instead?

2
I intentionally left out hierarchical clustering because hclust is a rather memory hungry method, not suitable for large datasets in which i'm actually mostly interested.George Dontas
Please define what you mean by "optimal"hadley
Great question @Svante, I've been thinking a lot on that one. I even intended to write a package with several algorithms for optimal number of clusters (hclust methods only). @hadley, I've acquainted with: C-H index (Calinsky & Harabasz), C-index, Goodman-Kruskal gamma coef. and there's a way to "pick an optimal cluster solution" by utilizing F-test. Here's a ref: Miligan, G.W. & Cooper, M.C. (1985). An Examination of Procedures for Determining the Number of Clusters in a Data Set, Psychometrika, 50, 159-179 Though I assume that you prefer "graph-based" decision on optimal solution...aL3xa
@hadley, in the sense of maximizing some score function having as arguments perhaps the between class distance and the within class distance. See, for example the method described in paragraph Optimal Number of Clusters here: sandro.saitta.googlepages.com/…George Dontas
This may also come in handy: stats.stackexchange.com/questions/723/…nico

2 Answers

5
votes

For large "sparse" datasets i would seriously recommend "Affinity propagation" method. It has superior performance compared to k means and it is deterministic in nature.

http://www.psi.toronto.edu/affinitypropagation/ It was published in journal "Science".

However the choice of optimal clustering algorithm depends on the data set under consideration. K Means is a text book method and it is very likely that some one has developed a better algorithm more suitable for your type of dataset/

This is a good tutorial by Prof. Andrew Moore (CMU, Google) on K Means and Hierarchical Clustering. http://www.autonlab.org/tutorials/kmeans.html

0
votes

Last week I coded up such an estimate-the-number-of-clusters algorithm for a K-Means clustering program. I used the method outlined in:

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.70.9687&rep=rep1&type=pdf

My biggest implementation problem was that I had to find a suitable Cluster Validation Index (ie error metric) that would work. Now it is a matter of processing speed, but the results currently look reasonable.