3
votes

I have to make clusters in categorical data. I am using following k-modes code to make cluster, and check optimum number of clusters using elbow method:

set.seed(100000)

cluster.results <-kmodes(data_cluster, 5 ,iter.max = 100, weighted = FALSE ) 

print(cluster.results)

k.max <- 20

wss <- sapply(1:k.max, 
              function(k){set.seed(100000)
                sum(kmodes(data_cluster, k, iter.max = 100 ,weighted = FALSE)$withindiff)})

wss

plot(1:k.max, wss,
     type="b", pch = 19, frame = FALSE, 
     xlab="Number of clusters K",
     ylab="Total within-clusters sum of squares")

My Questions are:

  1. Is there any other method in Kmodes for checking Optimum number of clusters?
  2. Each seed is giving a different size of nodes, hence I am trying different seeds, and setting the seed with least total within-sum of squares, is this approach correct?
  3. How to check if my clusters are stable?
  4. I want to apply/predict this cluster in new data (of another year). How to do that?
  5. Is there any other method of clustering categorical data?
2
Please always add the libraries that you used (klaR for instance?) and a minimal amount of data that we can use to reproduce your problem. For instance, you can paste the output of dput(data_cluster).hpesoj626

2 Answers

0
votes

My answer only concerns the question 5.

You can use mixutre models to cluster categorical data (see for instance the latent class model). The standard approaches consider a mixture of multinomial distributions.

Classical information criteria (like BIC or ICL) can be used to automatically select the number of clusters.

Mixtures permit to compute the probabilities of classification of a new observation, and thus to quantify the risk of misclassification.

If you are interested in this method, you can use the R package VarSelLCM. To cluster categorical data, you dataset must be a data.frame and each variable must be stored in factor.

Here is an example of code (number of clusters is allowed to be between 1 and 6)

require(VarSelLCM)

    out <- VarSelCluster(data_cluster, 1:6, vbleSelec=FALSE)

    summary(out)

    VarSelShiny(out)
-1
votes

Hope this helps:

install.packages( "NbClust", dependencies = TRUE )
library ( NbClust )

Data_Sim <- rbind ( matrix ( rbinom ( 250, 2, 0.25 ), ncol = 5 ),
  matrix ( rbinom (250, 2, 0.75 ), ncol = 5 ))
colnames ( Data_Sim ) <- letters [ 1:5 ]

Clusters <- NbClust ( Data_Sim, diss = NULL, distance = "euclidean",
  min.nc = 2, max.nc = 10, method = "kmeans", index = "all",
  alphaBeale = 0.1 )

hist ( Clusters$Best.nc [ 1, ], breaks = max ( na.omit (
  Clusters$Best.nc [ 1, ])))