0
votes

I have established my gene clusters and already calculated the distances needed to measure their phylogenetic relationship. I used an algorithm basically gives a measure of distance between gene clusters and is represented in a dataframe such as (Input Example):

BGC1      BGC2     Distance
------------------------------ 
BGC31     BGC34     0.6
BGC34     BGC45     0.7
BGC34     BGC53     0.2
BGC53     BGC31     0.8

x <- data.frame(BGC1 = c('BGC31','BGC34','BGC34','BGC35'), 
                BGC2 = c('BGC34','BGC45','BGC53','BGC51'), 
                distance = c(0.6,0.7,0.2,0.8))

Goal: Would it be possible to construct a tree just based on this type of data? I want to have a .newick file available for this as well, I'm not sure if this is possible using R though.

However, I have been able to create network visualizations from this data through Cytoscape but not possibly a tree. Any further suggestions for this particular example?

Thanks once again for your input :)

1
My R is weak, but: I did this a while ago using python's BioPython module Bio.Phylo.TreeConstruction with DistanceTreeConstructor and DistanceMatrix. Wrangle your distances into the correct format for DistanceMatrix, convert it into a tree and draw the tree with upgma/nj.Pallie
I can also try on python, I just had a preference in this case for R, however when you say wrangle your distances into the correct format? What does this imply? Sorry for my ignorance on thisBiohacker
from biopython.org/DIST/docs/api/… : Distance matrix constructor takes names and matrix as arguments. The names are just a flat list of your genenames. Matrix is a lower triangular format distance matrix of all all genes vs all genes.Pallie
@Pallie is it possible to use as the input for this, the matrix that I have in the example above? Currently my table of interest consists of these three columns.Biohacker

1 Answers

0
votes

Following the suggestion in a comment by user20650 here, you can define how to wrap the distances to a dist object using the lower.tri function. However, the provided example will not work, because it does not provide pairwise distances between samples. The solution thus takes your sample names, generates random data and then constructs the tree with the nj function from the ape package.

# get all sample names
x.names = unique(c(levels(x[, 1]), levels(x[, 2])))
n = length(x.names)

# create all combinations for samples for pairwise comparisons
x2 = data.frame(t(combn(x.names, m = 2)))
# generate random distances
set.seed(4653)
x2$distance = sample(seq(from = 0.1, to = 1, by = 0.05), size = nrow(x2), replace = TRUE)

# prepare a matrix for pairwise distances
dst = matrix(NA, ncol = n, nrow = n, dimnames = list(x.names, x.names))
# fill the lower triangle with the distances obtained elsewhere
dst[lower.tri(dst)] = x2$distance

# construct a phylogenetic tree with the neighbour-joining method
library(ape)
tr = nj(dst)
plot(tr)

enter image description here

The newick format of the tree can be saved with ape::write.tree function or printed to the console as:

cat(write.tree(tr))
# (BGC53:0.196875,BGC45:0.153125,(((BGC35:0.025,BGC51:0.275):0.1583333333,BGC31:0.2416666667):0.240625,BGC34:0.246875):0.003125);