1
votes

I'm using the ggtree package from Bioconductor to plot two phylogenetic trees. It works essentially like ggplot2, and I want to modify the aesthetics of the tip labels to match classes set by an external CSV file.

I have a multiPhylo object that contains two different clusterings of the same 50 genes (we'll pretend there are only 6 for this example). When I evaluate multitree[[1]]$tip.label and multitree[[2]]$tip.label they both give me the same list in the same order, so I know that while the plots are displayed differently, the genes are still stored in the same order.

library(ggtree)
library(ape)

mat <- as.dist(matrix(data = rexp(200, rate = 10), nrow = 6, ncol = 6))
nj.tree <- nj(mat)  ### Package ape
hclust.tree <- as.phylo(hclust(mat))
multitree <- c(nj.tree, hclust.tree)

I want to plot these trees and then annotate them with external data based on which of 5 classes (A, B, C, D, and E) they are according to existing literature.

write.csv(multitree[[1]]$tip.label, "Genes.csv")

I used this command to create a CSV file of each of the genes in the right order (not sure if that's relevant). I then manually entered the corresponding class letter in the column adjascent to each gene. It looks something like this:

Gene    Class
1       A
2       A
3       D
4       C
5       B
6       E

And so on.

I want to annotate the tip labels colors on my tree to correspond to the colors defined in my external CSV table. I know it would look something like geom_tiplab(aes(color=something something something)), but I don't know how to make it so that it reads the data inside my CSV and not the data within the multitree. Here's what my ggtree command looks like

myTree <- ggtree(multitree[[i]], aes(x, y)) + 
    ggtitle(names(multitree)[i]) + 
    geom_tiplab() +   ### What I want to annotate with color
    theme_tree2() + 
    coord_fixed(ratio = 0.5) 
print(myTree)         ###Occurs within a for loop, forces ggplot output to display
1
How do the values in the CSV file sample (Gene, Class) correspond to the values in multitree[[1]]$tip.label? There's are no corresponding values to match between them.eipi10
The gene names from the CSV file are printed directly from the tip.label names. That means they have the name order and same names. Is there any other way I can externally add in info on which class each gene belongs to?Ed Doe
In your example multitree[[1]]$tip.label has values 1 through 50. There are no such values in your CSV example, so how does one figure out which tip.label corresponds to which row in the CSV file? Also, it would be helpful if you created a much smaller example, say, 5 or 6 tip labels, then created a data frame with (analogous to your CSV file) that matches tip.label with Class (which is what it seems like you're trying to do.eipi10
Oh sorry, I used Gene1 and Gene2 etc. as generic examples for the CSV file. In the actual thing, if you created it using the random distance matrix, they would just be called 1, 2, 3, etc. I'll edit it.Ed Doe

1 Answers

1
votes

Create a color vector for the class names from your table.

g <- read.csv("Genes.csv")
cols <- rainbow(nlevels(g$Class))

# Function to identify class color for a certain gene 
findCol <- function(x){
    col <- switch(as.character(x), A=cols[1], B=cols[2], C=cols[3], D=cols[4], E=cols[5])
    return(col)
}
col.vect <- sapply(g$Class, findCol)

Use this vector in your geom_tiplab() function.