
I have to make clusters in categorical data. I am using following k-modes code to make cluster, and check optimum number of clusters using elbow method:


cluster.results <-kmodes(data_cluster, 5 ,iter.max = 100, weighted = FALSE ) 


k.max <- 20

wss <- sapply(1:k.max, 
                sum(kmodes(data_cluster, k, iter.max = 100 ,weighted = FALSE)$withindiff)})


plot(1:k.max, wss,
     type="b", pch = 19, frame = FALSE, 
     xlab="Number of clusters K",
     ylab="Total within-clusters sum of squares")

My Questions are:

  1. Is there any other method in Kmodes for checking Optimum number of clusters?
  2. Each seed is giving a different size of nodes, hence I am trying different seeds, and setting the seed with least total within-sum of squares, is this approach correct?
  3. How to check if my clusters are stable?
  4. I want to apply/predict this cluster in new data (of another year). How to do that?
  5. Is there any other method of clustering categorical data?
Please always add the libraries that you used (klaR for instance?) and a minimal amount of data that we can use to reproduce your problem. For instance, you can paste the output of dput(data_cluster).hpesoj626

2 Answers


My answer only concerns the question 5.

You can use mixutre models to cluster categorical data (see for instance the latent class model). The standard approaches consider a mixture of multinomial distributions.

Classical information criteria (like BIC or ICL) can be used to automatically select the number of clusters.

Mixtures permit to compute the probabilities of classification of a new observation, and thus to quantify the risk of misclassification.

If you are interested in this method, you can use the R package VarSelLCM. To cluster categorical data, you dataset must be a data.frame and each variable must be stored in factor.

Here is an example of code (number of clusters is allowed to be between 1 and 6)


    out <- VarSelCluster(data_cluster, 1:6, vbleSelec=FALSE)



Hope this helps:

install.packages( "NbClust", dependencies = TRUE )
library ( NbClust )

Data_Sim <- rbind ( matrix ( rbinom ( 250, 2, 0.25 ), ncol = 5 ),
  matrix ( rbinom (250, 2, 0.75 ), ncol = 5 ))
colnames ( Data_Sim ) <- letters [ 1:5 ]

Clusters <- NbClust ( Data_Sim, diss = NULL, distance = "euclidean",
  min.nc = 2, max.nc = 10, method = "kmeans", index = "all",
  alphaBeale = 0.1 )

hist ( Clusters$Best.nc [ 1, ], breaks = max ( na.omit (
  Clusters$Best.nc [ 1, ])))