I am trying to color the labels of a dendrogram based on a portion of the label name. The label name is derived from file names in a folder. The files are .txt files and are named in this manner: 167_001.txt with the first three numbers indicating a particular author of a text and the final three numbers differentiating the separate pieces of writing by that author. I want to label the branches by the full name of the file, but color the label only based on the first 3 numbers so I can see which works by a specific author might have more in common with a different author to see who influenced who. These are medieval authors so you won't be helping me find any modern authors who might have plagiarized something. So if a file begins with 080 I want all the 080 files to be one color no matter what the final part of the file name is and no matter where it is grouped, but I still want the end part of the file to be in the label name. Here is what I have so far:
# Load data
data(USArrests)
dd <- dist(scale(USArrests), method = "euclidean")
#Perform a cluster analysis on the distance object
hc <- hclust(dd)
#Get the text file names to use as labels
dend <- as.dendrogram(hc)
dend2 = color_unique_labels(dend)
d5gr=color_branches(dend2,5,groupLabels=TRUE)
#plot(d5gr)
plot(d5gr, horiz=TRUE)
As you can see I am using the dendextend package. If anyone has a better package or one that will accomplish what I need just as well, that would be great. What I currently have will put the files in the same color family since they are similar enough the "color_unique_labels" function offered by dendextend at least sort of colors them in shades of the same color, but it does not make them the exact same color as I would like so the same author is always the same color and then it is easier to see which works share similarities with different authors. See below. There are a few hundred different authors so I would prefer to not assign each one an individual color manually (A ="red", B ="blue", C ="orchid", etc...), but would prefer something that works like "color_unique_labels" and automatically chooses and assigns a color based on the first 3 numbers in the filename. My example is using the USArrests package and I would perhaps like to see how to color the state names by the first letter, so all the "A" states and "C" states and so on are the same color. So Alabama, Alaska, Arizona, and Arkansas should all be the same color and California, Colorado, and Connecticut should also be the same color. Again, I would prefer to find a way to make it more automated as my real dataset has a few hundred possibilities and not just 50, however, I am not opposed if that is the only option. Thanks in advance!