How to color dendrogram labels using R based on label name not grouping

Question

I am trying to color the labels of a dendrogram based on a portion of the label name. The label name is derived from file names in a folder. The files are .txt files and are named in this manner: 167_001.txt with the first three numbers indicating a particular author of a text and the final three numbers differentiating the separate pieces of writing by that author. I want to label the branches by the full name of the file, but color the label only based on the first 3 numbers so I can see which works by a specific author might have more in common with a different author to see who influenced who. These are medieval authors so you won't be helping me find any modern authors who might have plagiarized something. So if a file begins with 080 I want all the 080 files to be one color no matter what the final part of the file name is and no matter where it is grouped, but I still want the end part of the file to be in the label name. Here is what I have so far:

# Load data
data(USArrests)
dd <- dist(scale(USArrests), method = "euclidean")

#Perform a cluster analysis on the distance object 
hc <- hclust(dd)
#Get the text file names to use as labels


dend <- as.dendrogram(hc)

dend2 = color_unique_labels(dend)
d5gr=color_branches(dend2,5,groupLabels=TRUE)
#plot(d5gr)
plot(d5gr, horiz=TRUE)

As you can see I am using the dendextend package. If anyone has a better package or one that will accomplish what I need just as well, that would be great. What I currently have will put the files in the same color family since they are similar enough the "color_unique_labels" function offered by dendextend at least sort of colors them in shades of the same color, but it does not make them the exact same color as I would like so the same author is always the same color and then it is easier to see which works share similarities with different authors. See below. There are a few hundred different authors so I would prefer to not assign each one an individual color manually (A ="red", B ="blue", C ="orchid", etc...), but would prefer something that works like "color_unique_labels" and automatically chooses and assigns a color based on the first 3 numbers in the filename. My example is using the USArrests package and I would perhaps like to see how to color the state names by the first letter, so all the "A" states and "C" states and so on are the same color. So Alabama, Alaska, Arizona, and Arkansas should all be the same color and California, Colorado, and Connecticut should also be the same color. Again, I would prefer to find a way to make it more automated as my real dataset has a few hundred possibilities and not just 50, however, I am not opposed if that is the only option. Thanks in advance!

dendextend dendrogram using "color_unique_labels" function

Please add a simple reproducible example for people to work with. I also don't understand the criteria to color the label based on the author, but not assign each author an individual color. Those seem to be at odds. — gung - Reinstate Monica
Trying to find a simple reproducible example. The problem is many of the suggestions I have tried previously do not work with my dataset even though it appears to be a similar problem. Since the dendrogram is comparing the text of each file and grouping them based on textual similarities, but labeling them based on the filename instead of the text contained in the file, I keep getting various errors. As far as the confusion with my wording, I edited my question to clarify, but basically, I do not want to manually assign each author a color, but would prefer an automated process, if that helps. — DHranger
Often, I find that in the process of figuring out how to create a reproducible example to ask people about, I discover the solution myself. You won't need most of the code you show for a simple example. Do you know how to extract the numbers you want from the file names, eg? If so, that can be skipped. Also, if the only issue w/ colors for authors is if you don't want to manually assign them, that's trivial. — gung - Reinstate Monica
DHranger I don't see why just using labels_colors with labels won't do your trick. As @gung wrote, please provide a SIMPLE reproducible example and we'll gladly try to demonstrate this in an answer. — Tal Galili
Okay, as simple and reproducible as I could get. Again, need them colored based on the first few parts of the filename. In this sample dataset, how do I color all the "A" states "red" and the "C" states "green", and the "M" states purple, but still have the label show the entire state name. How do I do this without having to write out the entire name or names and make them equal to a color ("Alaska", "Alabama", "Arizona", "Arkansas")="red" as this will take much longer than I hope to spend as there hundreds of authors and thousands of individual files (as each author has multiple works). — DHranger

Milan Milan · Accepted Answer · 2017-07-06T15:20:57

Hope your question was answered by now. In case this is still useful to you, here's my stab at solving it:

First you create a new variable that groups the authors in their categories (you said something about the beginning of a categorical variable you already had). Depending on the nr of categories want to create you'll need different code, check the Quick R Recoding variables section and this tutorial on recode() for what might work for your particular case.

If this proves difficult in R, maybe try generating the group variable in Excel - it has a good filtering function that can help you quickly fill in the reference code. For future dataset/dataframe management, I can recommend chopping it up in as many variables as you actually have: If I understand your problem correctly it seems like part of the issue comes from having two categorical variables in one (group + author = filename).

Once you've got your group variable ("GROUP") you'll need to assign a color set to it:

library(dendextend)
library(colorspace)

#make GROUP color palette: 

GROUP <-dataframe$GROUP #factorize group variable
n_GROUP <- length(unique(GROUP)) #count nr of unique groups
cols <- rainbow_hcl(n_GROUP) #select a number of colors based on GROUP size
col_GROUP <- cols[GROUP] #make color palette assigning the selected colors to the groups

So here is where the 'color labels by category' trick actually takes place: after you've made the dendrogram but before plotting the dendrogram you sort your colors according to the dendrogram (dend):

dend <- as.dendrogram(hc) #hc into dendrogram

#sort GROUP color palette according to dend:
col_GROUP <- col_GROUP[order.dendrogram(dend)] 

#plot dendrogram as you would normally do, I did this:
dend <- dend %>% 
set("labels_colors", col_GROUP) %>% #change label colors to GROUP
plot(main = "Dendrogram with labels colored according to GROUP")
legend("topleft", legend = levels(GROUP), fill = cols, cex = 0.5)

This should color your labels according to their group category. A similar thing can be done in case you want to change the label names too (i.e. change the unique names needed for analysis to bigger category names). You just sort the GROUP factor according to dend and set labels while plotting to GROUP (see dendextended::set in R help for plotting options).

Hope this helps, Cheers!

How to color dendrogram labels using R based on label name not grouping

1 Answers