1
votes

When performing the hierarchical clustering in R with the hclust function. How do you know the height of the final merge?

So to clarify with some R default data:

hc <- hclust(dist(USArrests))
dendrogram1 = as.dendrogram(hc)
plot(hc)

Will result in a variable hc with all clustering info.

R clustering output

And the dendrogram:

R dendrogram

As you can see on the dendrogram, the final merge happens at a height > 200 (about 300). But how does the dendrogram know? This info is not in the hc.height variable nor in the dendrogram1 variable. The highest mentioned merge is at 169.

variable dendrogram1

If the dendrogram1 variable does not contain this information, how does the plot function know the merge must occur at a height of 300?

dendrogram R top merge

I am asking this because I require this number (+- 300) for other applications and reading it from the plot is downright impractical.

thanks in advance for anyone willing to help!

2
Can you also paste your code, so that people can run it by them self. To copy youre data you can use dput() if the table is to long enclose it with head(dput())Sander Van der Zeeuw
Hi Sander, the top 3 lines was actually the full code to produce all the screenshots and the data should be in your R already.Sleenee
ahh never mind then ;).Sander Van der Zeeuw

2 Answers

6
votes

These values can be calculated with stats::cophenetic():

The cophenetic distance between two observations that have been clustered is defined to be the intergroup dissimilarity at which the two observations are first combined into a single cluster.

This yields the following for your example:

sort(unique(cophenetic(hc)))
#  [1]   2.291   3.834   3.929   6.237   6.638   7.355   8.027   8.538  10.860
# [10]  11.456  12.425  12.614  12.775  13.045  13.297  13.349  13.896  14.501
# [19]  15.408  15.454  15.630  15.890  16.977  18.265  19.438  19.904  21.167
# [28]  22.366  22.767  24.894  25.093  28.635  29.251  31.477  31.620  32.719
# [37]  36.735  36.848  38.528  41.488  48.725  53.593  57.271  64.994  68.762
# [46]  87.326 102.862 168.611 293.623
3
votes

@rcs answer is correct.

I will give another way to solve it, by using the get_nodes_attr function from the dendextend package:

# install.packages("dendextend")
library(dendextend)

dend <- as.dendrogram(hclust(dist(USArrests[1:5,])))
# Like: 
# dend <- USArrests[1:5,] %>% dist %>% hclust %>% as.dendrogram

# The height for all nodes:
get_nodes_attr(dend, "height")

And we can easily see the height for each node:

> get_nodes_attr(dend, "height")
[1] 108.85192   0.00000  63.00833  23.19418   0.00000   0.00000  37.17701   0.00000   0.00000

For more details on the package, you can have a look at its vignette.