1
votes

I have a dataset with 2 different (1 externally run, 1 done myself) clustering solutions. I want to compare them using the tanglegram and entanglement commands in the dendextend package, however I keep having errors regarding labels and I cannot figure out why. To illustrate, I've cooked up a simple example using mtcars:

df1 <- mtcars
df1$ID <- row.names(mtcars)
clusts <- 1:3

# simulate two different cluster algorithms as columns containing cluster group
df1$cl1 <- sample(clusts, nrow(df1), replace = TRUE)
df1$cl2 <- sample(clusts, nrow(df1), replace = TRUE)
table(df1$cl1, df1$cl2)

# Make a copy
df2 = df1

# Use data.tree to convert df's to data.trees
library(data.tree)
df1$pathString <- paste("Tree1", df1$cl1, df1$ID, sep = "/")
df2$pathString <- paste("Tree2", df2$cl2, df2$ID, sep = "/")

node1 <- as.Node(df1)
node2 <- as.Node(df2)

# Convert to dendrograms and compare using dendextend
library(dendextend)
dend1 <- as.dendrogram(node1)
dend2 <- as.dendrogram(node2)

tanglegram(dend1, dend2)
entanglement(dend1, dend2)

This gives these errors:

> tanglegram(dend1, dend2)
Error in dend12[[1]] : subscript out of bounds
In addition: Warning message:
In intersect_trees(dend1, dend2, warn = TRUE) :
  The two trees had no common labels!
> entanglement(dend1, dend2)
Error in match_order_by_labels(dend2, dend1) : 
  labels do not match in both trees.  Please make sure to fix the labels    names!
(make sure also that the labels of BOTH trees are 'character')

I do not understand why these errors are occurring and examining the data structures is not giving me the answer! Any helpful enlightenment would be much appreciated!

EDIT Taking note of @emilliman5 's answer below: I understand that my dendrograms are unresolved - I'm not using hierarchical clustering and so I want to compare un-resolved dendrograms. More - I've adopted some code from this question: How do I manually create a dendrogram (or "hclust") object ? (in R) to build the dendrograms myself - and these will produce a tanglegram despite being unresolved. However this is not a solution as its too hard to generalised to varying parameters (my tree depth/resolution varies and trying to write a function to code trees with varying levels of nesting is a road to insanity!).

tree1 <- list()
attributes(tree1) <- list(members=nrow(df1), height=3)
class(tree1) <- "dendrogram"

# Assign leaf names to list
leaves <- list()
leaf_height_list <- list()
for(i in 1:length(clusts)){
    leaves[[i]] <- which(df1$cl1 == (i) )
}
for(i in 1:length(clusts)){
    tree1[[i]] <- list()
    attributes(tree1[[i]]) <- list(members=length(which(df1$cl1==i)), height=2, edgetext=i)
    for( j in 1:length(leaves[[i]]) ){
        tree1[[i]][[j]] <- list()
        tree1[[i]][[j]] <- leaves[[i]]
        attributes(tree1[[i]][[j]]) <- list(members = 1, height = 1,
                                       label = as.character(leaves[[i]][j]),
                                       leaf = TRUE)
    }
}
plot(tree1, center=TRUE)

tree2 <-list();
attributes(tree2) <- list(members=nrow(df2), height=3)
class(tree2) <- "dendrogram"

# Assign leaf names to list
leaves <- list()
leaf_height_list <- list()
for(i in 1:length(clusts)){
    leaves[[i]] <- which(df2$cl2 == (i) )
}
for(i in 1:length(clusts)){
    tree2[[i]] <- list()
    attributes(tree2[[i]]) <- list(members=length(which(df2$cl2==i)), height=2, edgetext=i)
    for( j in 1:length(leaves[[i]]) ){
        tree2[[i]][[j]] <- list()
        tree2[[i]][[j]] <- leaves[[i]]
        attributes(tree2[[i]][[j]]) <- list(members = 1, height = 1,
                                        label = as.character(leaves[[i]][j]),
                                        leaf = TRUE)
    }
}
plot(tree2, center=TRUE)

tanglegram(tree1, tree2)

Ugly tanglegram

Its ugly but its all I want/need.

Trying to figure out why this works, if I peek into the dendrograms:

> str(unclass(tree1[[1]][[1]]))
 atomic [1:12] 1 8 9 10 11 13 16 22 25 27 ...
 - attr(*, "members")= num 1
 - attr(*, "height")= num 1
 - attr(*, "label")= chr "1"
 - attr(*, "leaf")= logi TRUE

You notice there is a vector. Peeking into a hclust derived dendrogram we see there is also a vector/atomic:

> str(unclass(as.dendrogram(hclust(dist(df1))))[[1]][[1]])
 atomic [1:1] 31
 - attr(*, "members")= int 1
 - attr(*, "height")= num 0
 - attr(*, "label")= chr "Maserati Bora"
 - attr(*, "leaf")= logi TRUE

However, peeking into the data.tree created dendrogram I note there is no vector/atomic:

> str(unclass(dend1[[1]][[1]]))
 list()
 - attr(*, "label")= chr "Mazda RX4"
 - attr(*, "members")= num 1
 - attr(*, "height")= num 0
 - attr(*, "leaf")= logi TRUE

Could this missing atomic be causing a problem ?

1
It looks like you have unresolved trees which would prevent the tanglegram and entanglement. have a look at the dendrograms with plot(dend1) and make sure your trees are resolved.emilliman5
Thanks for your answer. One question - I'm not sure what you mean by "resolved" in this context. Anyhow I have plotted them both using plot(dend1, center=TRUE, horiz=TRUE) and I don't see any obvious problem. They both have the same number of leaves with the same labels although of course the order is different (by design).user2498193
I think you are right, the error has to do with data.tree's representation of a dendrogram no the resolution of the tree. After some more thought about your broader goals, a simple confusion matrix might be a better representation of the clustering differences. You could even turn it into a heatmap to increase it's aesthetic value. Good luck!emilliman5
Yes I think I will look into other options alsouser2498193

1 Answers

1
votes

The problem is that your tree is not dichotomous, that is at each node you have more than two branches you could traverse. In hierarchical clustering each node should only have two branches. See the two examples below:

This is the tree from your example

enter image description here

This is what a resolved tree should look like

plot(hclust(dist(df1[, 1:11])))

enter image description here