I am working to have a phylogenetic tree based on pairwise-data of genes.Below is my subset of the data(test.txt).The tree does not has to be constructed on the basis of any DNA sequences,but just treating it as words.
ID gene1 gene2
1 ADRA1D ADK
2 ADRA1B ADK
3 ADRA1A ADK
4 ADRB1 ASIC1
5 ADRB1 ADK
6 ADRB2 ASIC1
7 ADRB2 ADK
8 AGTR1 ACHE
9 AGTR1 ADK
10 ALOX5 ADRB1
11 ALOX5 ADRB2
12 ALPPL2 ADRB1
13 ALPPL2 ADRB2
14 AMY2A AGTR1
15 AR ADORA1
16 AR ADRA1D
17 AR ADRA1B
18 AR ADRA1A
19 AR ADRA2A
20 AR ADRA2B
Below is my code in R
library(ape)
tab=read.csv("test.txt",sep="\t",header=TRUE)
d=dist(tab,method="euclidean")
fit <- hclust(d, method="ward")
plot(as.phylo(fit))
My figure is attached here
I have a question on how they are clustered.Since the pairs
17 AR ADRA1B
18 AR ADRA1A
and
2 ADRA1B ADK
3 ADRA1A ADK
should be clustered closely because they have one common gene.so 17 and 2 should be together,and 18 and 3.
Should I use any other method,if I am wrong in using this method(Euclidean distance)?
Should I convert my data to a matrix of rows and columns ,where gene1 is x-axis ,and gene2 is y-axis,each cell being filled by 1 or 0?(Basically if they are paired would mean 1, and if not then 0)
Updated Code :
table=table(tab$gene1, tab$gene2)
d <- dist(table,method="euclidean")
fit <- hclust(d, method="ward")
plot(as.phylo(fit))
However, in this I get only the genes from gene1 and not gene2 column.The below figure is exactly what I want but should have genes from gene2 column as well
dist
calculates the euclidean distance on the factor levels, nothing reasonable can be expected, I think. – Georg Schnabelhclust
will do clustering based on the identity of each gene -- i.e. if taxon 1 hasgene1=A
,gene2=B
and taxon 2 hasgene2=B
,gene2=A
, they won't match at all ... – Ben Bolker