Phylogenetic tree

Question

I am working to have a phylogenetic tree based on pairwise-data of genes.Below is my subset of the data(test.txt).The tree does not has to be constructed on the basis of any DNA sequences,but just treating it as words.

ID  gene1   gene2

1   ADRA1D  ADK
2   ADRA1B  ADK
3   ADRA1A  ADK
4   ADRB1   ASIC1
5   ADRB1   ADK
6   ADRB2   ASIC1
7   ADRB2   ADK
8   AGTR1   ACHE
9   AGTR1   ADK
10  ALOX5   ADRB1
11  ALOX5   ADRB2
12  ALPPL2  ADRB1
13  ALPPL2  ADRB2
14  AMY2A   AGTR1
15  AR  ADORA1
16  AR  ADRA1D
17  AR  ADRA1B
18  AR  ADRA1A
19  AR  ADRA2A
20  AR  ADRA2B

Below is my code in R

library(ape)
tab=read.csv("test.txt",sep="\t",header=TRUE)
d=dist(tab,method="euclidean")
fit <- hclust(d, method="ward")
plot(as.phylo(fit))

My figure is attached here enter image description here

I have a question on how they are clustered.Since the pairs

 17 AR  ADRA1B
 18 AR  ADRA1A

and

 2  ADRA1B  ADK
 3  ADRA1A  ADK

should be clustered closely because they have one common gene.so 17 and 2 should be together,and 18 and 3.

Should I use any other method,if I am wrong in using this method(Euclidean distance)?

Should I convert my data to a matrix of rows and columns ,where gene1 is x-axis ,and gene2 is y-axis,each cell being filled by 1 or 0?(Basically if they are paired would mean 1, and if not then 0)

Updated Code :

   table=table(tab$gene1, tab$gene2)
   d <- dist(table,method="euclidean")
   fit <- hclust(d, method="ward")
   plot(as.phylo(fit))

However, in this I get only the genes from gene1 and not gene2 column.The below figure is exactly what I want but should have genes from gene2 column as well

enter image description here

I'm a bit puzzled. How can an euclidean distance be computed on strings? Are the gene1 and gene2 columns in your example really strings or factors? In case they are factor and distcalculates the euclidean distance on the factor levels, nothing reasonable can be expected, I think. — Georg Schnabel
I'm not sure, but it looks like you're trying to do clustering based on the union of the genes present in a given taxon, whereas I think that hclust will do clustering based on the identity of each gene -- i.e. if taxon 1 has gene1=A, gene2=B and taxon 2 has gene2=B, gene2=A, they won't match at all ... — Ben Bolker
@Georg Those are strings ,and I am looking for methods to get some relation out of this table to have some clustering and build a tree.i agree Euclidean can not be used,I just wanted to give an example,as to what I want. — Rgeek

Georg Schnabel Georg Schnabel · Accepted Answer · 2014-02-17T22:51:40

There is some room for interpretation in the example of the question. My answer is only valid if there are really only two genes present in each individual and each row describes an individual. If, however, each row means that gene1 occurs with gene2 with certainty no useful clustering can be performed, in my opinion. In that case I would expect an additional column stating the probability for their common occurrence and something like an principal component analysis (PCA) may be preferred, but I am far away from being an expert on (hierarchial) clustering.

Before you can use the dist function, you have to bring your data into an appropriate format:

# convert test data into suitable format
gene.names <- sort(unique(c(tab[,"gene1"],tab[,"gene2"])))
gene.matrix <- cbind(tab[,"ID"],matrix(0L,nrow=nrow(tab),ncol=length(gene.names)))
colnames(gene.matrix) <- c("ID",gene.names)
lapply(seq_len(nrow(tab)),function(x) gene.matrix[x,match(tab[x,c("gene1","gene2")],colnames(gene.matrix))]<<-1)

The obtained gene.matrix has the shape:

     ID ACHE ADK ADORA1 ADRA1A ADRA1B ADRA1D ADRA2A ...
[1,]  1    0   1      0      0      0      1      0
[2,]  2    0   1      0      0      1      0      0
[3,]  3    0   1      0      1      0      0      0
[4,]  4    0   0      0      0      0      0      0
...

So each row represents an observation (=individual) where the first column identifies the individual and each of the subsequent columns contains 1 if the gene is present and 0 if it is missing. On this matrix the dist function can be reasonably applied (ID column removed):

d <- dist(gene.matrix[,-1],method="euclidean")
fit <- hclust(d, method="ward")
plot(as.phylo(fit))

Maybe, it is a good idea to read up the differences between the distance measures euclidean, manhattan etc. For instance, the euclidian distance between the individuals with ID=1 and ID=2 is:

euclidean_dist = sqrt((0-0)^2 + (1-1)^2 + (0-0)^2 + (0-0)^2 + (0-1)^2 + ...)

whereas the manhattan distance is

manhattan_dist = abs(0-0) + abs(1-1) + abs(0-0) + abs(0-0) + abs(0-1) + ...

Phylogenetic tree

1 Answers