Bi clustering categorical data by two variables

Question

I have a table of categorical values I would like to cluster both by the rows, and by the columns.

Example data: test_dataset.csv

I,II,III,IV,V
A,0,3,3,2,3
B,0,3,3,0,0
C,0,0,3,3,3
D,0,3,1,3,0
E,0,0,3,0,0

The levels are "no data", "no increase", "mixed", and "increase".

I found an R package blockcluster that in theory should be able to do this.

#install.packages("blockcluster")
library(blockcluster)
#0 = no data, 1 = no increase, 2 = mixed, 3 = increase
dataset<-read.table("test_dataset.csv",header = T,  sep=',')
out<-coclusterCategorical(as.matrix(dataset),nbcocluster = c(3,2))
summary(out)
plot(out)

This is the resulting plot:

I would like to ask some help regarding how to interpret this plot, if someone has worked with this package before - how do I know which row/column represents what in the co-clustered data?

If I am not mistaken the nbcocluster parameter determines the resulting clusters row and column wise - how do I know beforehand what is the appropriate amount of clusters?

Is it appropriate to do categorical clustering if one of the categories is essentially missing data?

I am open to suggestions to other methods that can bicluster categorical data. I appreciate any and all help, I have never done this before.

What did you find in the documentation of the "blockcluster" package about this topic? — mkrieger1
Sadly the documentation for the package is not very clear, at least to me (cran.r-project.org/web/packages/blockcluster/blockcluster.pdf). This might also stem from my inexperience with clustering methods. For the first quetsion ("how do I know which row/column represents what in the co-clustered data") the description of the plot function is very barebones, it has two arguments, x and y, where x is the output of the clustering and y is Ignored (I am not sure what the latter means). My second and third questions are more general clustering questions I think. — Márton Oelbei

Márton Oelbei Márton Oelbei · Accepted Answer · 2020-07-27T20:36:17

For the first question, I figured out the answer (thanks to the forums at InriaForge)

So it doesn't show up on the plot by default, but you can bind the classification results to your original data, e.g.

result_c <-cbind(test_dataset,out@rowclass) 
result <- rbind(result_c, out@colclass)

I did not find a solution as to how to select the appropriate amount of clusters and whether it's appropriate to cluster with missing data.

Bi clustering categorical data by two variables

1 Answers