I am clustering a distance matrix based on a 20,000 row x 169 column data set in R using hclust(). When I convert the cluster object to a dendrogram and plot the entire dendrogram, it is difficult to read because it is so large, even if I output it to a fairly large pdf.
df <- as.data.frame(matrix(abs(rnorm(3380000)), nrow = 20000))
mydist <- vegdist(df)
my.hc <- hclust(mydist, method = "average")
hcd <- as.dendrogram(my.hc)
pdf("hclust_plot.pdf", width = 40, height = 15)
plot(hcd)
dev.off()
I would like to specify the number of clusters (k) at which to truncate the dendrogram, then plot only the upper portion of the dendrogram above the k split points. I know I can plot the upper portion based on specifying a height (h) using the function cut().
pdf("hclust_plot2.pdf", width = 40, height = 15)
plot(cut(hcd, h = 0.99)$upper)
dev.off()
I also know I can use the dendextend package to color the dendrogram plot with the k groups.
library(dendextend)
pdf("hclust_plot3.pdf", width = 40, height = 15)
plot(color_branches(hcd, k = 44))
dev.off()
But for my data set, this dendrogram is too dense to even read which group is which color. Is there a way to plot only the upper portion of the dendrogram above the cut point by specifying k, not h? Or is there a way to get the h value for a dendrogram, given k?