I have a dataset with 2 different (1 externally run, 1 done myself) clustering solutions. I want to compare them using the tanglegram
and entanglement
commands in the dendextend
package, however I keep having errors regarding labels and I cannot figure out why. To illustrate, I've cooked up a simple example using mtcars:
df1 <- mtcars
df1$ID <- row.names(mtcars)
clusts <- 1:3
# simulate two different cluster algorithms as columns containing cluster group
df1$cl1 <- sample(clusts, nrow(df1), replace = TRUE)
df1$cl2 <- sample(clusts, nrow(df1), replace = TRUE)
table(df1$cl1, df1$cl2)
# Make a copy
df2 = df1
# Use data.tree to convert df's to data.trees
library(data.tree)
df1$pathString <- paste("Tree1", df1$cl1, df1$ID, sep = "/")
df2$pathString <- paste("Tree2", df2$cl2, df2$ID, sep = "/")
node1 <- as.Node(df1)
node2 <- as.Node(df2)
# Convert to dendrograms and compare using dendextend
library(dendextend)
dend1 <- as.dendrogram(node1)
dend2 <- as.dendrogram(node2)
tanglegram(dend1, dend2)
entanglement(dend1, dend2)
This gives these errors:
> tanglegram(dend1, dend2)
Error in dend12[[1]] : subscript out of bounds
In addition: Warning message:
In intersect_trees(dend1, dend2, warn = TRUE) :
The two trees had no common labels!
> entanglement(dend1, dend2)
Error in match_order_by_labels(dend2, dend1) :
labels do not match in both trees. Please make sure to fix the labels names!
(make sure also that the labels of BOTH trees are 'character')
I do not understand why these errors are occurring and examining the data structures is not giving me the answer! Any helpful enlightenment would be much appreciated!
EDIT Taking note of @emilliman5 's answer below: I understand that my dendrograms are unresolved - I'm not using hierarchical clustering and so I want to compare un-resolved dendrograms. More - I've adopted some code from this question: How do I manually create a dendrogram (or "hclust") object ? (in R) to build the dendrograms myself - and these will produce a tanglegram despite being unresolved. However this is not a solution as its too hard to generalised to varying parameters (my tree depth/resolution varies and trying to write a function to code trees with varying levels of nesting is a road to insanity!).
tree1 <- list()
attributes(tree1) <- list(members=nrow(df1), height=3)
class(tree1) <- "dendrogram"
# Assign leaf names to list
leaves <- list()
leaf_height_list <- list()
for(i in 1:length(clusts)){
leaves[[i]] <- which(df1$cl1 == (i) )
}
for(i in 1:length(clusts)){
tree1[[i]] <- list()
attributes(tree1[[i]]) <- list(members=length(which(df1$cl1==i)), height=2, edgetext=i)
for( j in 1:length(leaves[[i]]) ){
tree1[[i]][[j]] <- list()
tree1[[i]][[j]] <- leaves[[i]]
attributes(tree1[[i]][[j]]) <- list(members = 1, height = 1,
label = as.character(leaves[[i]][j]),
leaf = TRUE)
}
}
plot(tree1, center=TRUE)
tree2 <-list();
attributes(tree2) <- list(members=nrow(df2), height=3)
class(tree2) <- "dendrogram"
# Assign leaf names to list
leaves <- list()
leaf_height_list <- list()
for(i in 1:length(clusts)){
leaves[[i]] <- which(df2$cl2 == (i) )
}
for(i in 1:length(clusts)){
tree2[[i]] <- list()
attributes(tree2[[i]]) <- list(members=length(which(df2$cl2==i)), height=2, edgetext=i)
for( j in 1:length(leaves[[i]]) ){
tree2[[i]][[j]] <- list()
tree2[[i]][[j]] <- leaves[[i]]
attributes(tree2[[i]][[j]]) <- list(members = 1, height = 1,
label = as.character(leaves[[i]][j]),
leaf = TRUE)
}
}
plot(tree2, center=TRUE)
tanglegram(tree1, tree2)
Its ugly but its all I want/need.
Trying to figure out why this works, if I peek into the dendrograms:
> str(unclass(tree1[[1]][[1]]))
atomic [1:12] 1 8 9 10 11 13 16 22 25 27 ...
- attr(*, "members")= num 1
- attr(*, "height")= num 1
- attr(*, "label")= chr "1"
- attr(*, "leaf")= logi TRUE
You notice there is a vector. Peeking into a hclust derived dendrogram we see there is also a vector/atomic:
> str(unclass(as.dendrogram(hclust(dist(df1))))[[1]][[1]])
atomic [1:1] 31
- attr(*, "members")= int 1
- attr(*, "height")= num 0
- attr(*, "label")= chr "Maserati Bora"
- attr(*, "leaf")= logi TRUE
However, peeking into the data.tree created dendrogram I note there is no vector/atomic:
> str(unclass(dend1[[1]][[1]]))
list()
- attr(*, "label")= chr "Mazda RX4"
- attr(*, "members")= num 1
- attr(*, "height")= num 0
- attr(*, "leaf")= logi TRUE
Could this missing atomic be causing a problem ?
plot(dend1)
and make sure your trees are resolved. – emilliman5plot(dend1, center=TRUE, horiz=TRUE)
and I don't see any obvious problem. They both have the same number of leaves with the same labels although of course the order is different (by design). – user2498193