19
votes

I've been using k-means to cluster my data in R but I'd like to be able to assess the fit vs. model complexity of my clustering using Baysiean Information Criterion (BIC) and AIC. Currently the code I've been using in R is:

KClData <- kmeans(Data, centers=2, nstart= 100)

But I'd like to be able to extract the BIC and Log Likelihood. Any help would be greatly appreciated!

4
Function Mclust in package mclust might be of interest.Roland
Roland, thanks for the tip! I'm actually trying to compare the results of k-means to Mclust outputs which is why I'd like to use the BIC from my k-means clustering to GMM that Mclust uses.UnivStudent
I am not an expert, but think that k-means is not a maximum likelihood algorithm. Are you sure that AIC and BIC are applicable?Roland
It does have a log Likelihood associated with it but I'm having trouble finding it and implementing it in R.UnivStudent
See a similar question on a statistician community stats.stackexchange.com/q/55147/3277ttnphns

4 Answers

16
votes

For anyone else landing here, there's a method proposed by Sherry Towers at http://sherrytowers.com/2013/10/24/k-means-clustering/, which uses output from stats::kmeans. I quote:

The AIC can be calculated with the following function:

kmeansAIC = function(fit){

m = ncol(fit$centers)
n = length(fit$cluster)
k = nrow(fit$centers)
D = fit$tot.withinss
return(D + 2*m*k)
}

From the help for stats::AIC, you can also see that the BIC can be calculated in a similar way to the AIC. An easy way to get the BIC is to replace the return() in the above function, with this:

return(data.frame(AIC = D + 2*m*k,
                  BIC = D + log(n)*m*k))

So you would use this as follows:

fit <- kmeans(x = data,centers = 6)
kmeansAIC(fit)
7
votes

To compute BIC, just add .5*k*d*log(n) (where k is the number of means, d is the length of a vector in your dataset, and n is the number of data points) to the standard k-means error function.

The standard k-means penalty is \sum_n (m_k(n)-x_n)^2, where m_k(n) is the mean associated with the nth data point. This penalty can be interpreted as a log probability, so BIC is perfectly valid.

BIC just adds an additional penalty term to the k-means error proportional to k.

4
votes

Just to add to what user1149913 said (I don't have enough reputation to comment), since you're using the kmeans function in R, \sum_n (m_k(n)-x_n)^2 is already calculated for you as KClData$tot.withinss.

2
votes

Rather than reimplementing AIC or BIC, we can define a log-likelihood function for kmeans objects; this will then get used by the BIC function in the stats package.

logLik.kmeans <- function(object) structure(
  -object$tot.withinss/2,
  df = nrow(object$centers)*ncol(object$centers),
  nobs = length(object$cluster)
)

Then to use it, call BIC as normal. For example:

example(kmeans, local=FALSE)
BIC(cl)
# [1] 26.22842084

This method will be provided in the next release of the stackoverflow package.