I have a table of categorical values I would like to cluster both by the rows, and by the columns.
Example data: test_dataset.csv
I,II,III,IV,V
A,0,3,3,2,3
B,0,3,3,0,0
C,0,0,3,3,3
D,0,3,1,3,0
E,0,0,3,0,0
The levels are "no data", "no increase", "mixed",
and "increase"
.
I found an R package blockcluster
that in theory should be able to do this.
#install.packages("blockcluster")
library(blockcluster)
#0 = no data, 1 = no increase, 2 = mixed, 3 = increase
dataset<-read.table("test_dataset.csv",header = T, sep=',')
out<-coclusterCategorical(as.matrix(dataset),nbcocluster = c(3,2))
summary(out)
plot(out)
This is the resulting plot:
I would like to ask some help regarding how to interpret this plot, if someone has worked with this package before - how do I know which row/column represents what in the co-clustered data?
If I am not mistaken the nbcocluster
parameter determines the resulting clusters row and column wise - how do I know beforehand what is the appropriate amount of clusters?
Is it appropriate to do categorical clustering if one of the categories is essentially missing data?
I am open to suggestions to other methods that can bicluster categorical data. I appreciate any and all help, I have never done this before.
plot
function is very barebones, it has two arguments, x and y, where x is the output of the clustering and y is Ignored (I am not sure what the latter means). My second and third questions are more general clustering questions I think. – Márton Oelbei