2
votes

Problem:

I'm trying to import a newick format phylogenetic tree, I've done this before, (a tree made in the same way, so the code works!) however the tree appears to be the problem. I'm getting a duplicate tip labels error. If that is the case, is there a way to easily remove duplicate tips in R?

Current code:

library(ape)
library(geiger)
library(caper)


taxatree <- read.tree("test2.tre")
sumdata <- read.csv("ogtprop.csv")
    sumdataPGLS <-data.frame(A=sumdata$A,OGT=sumdata$OGT, Species=sumdata$Species)


sumdataPGLS$Species<-gsub(" ", "_", sumdata$Species)
#this line inserts an underscore between species and genus in my dataframe, (as the tree is formatted like this)

comp.dat <- comparative.data(taxatree, sumdataPGLS, "Species")

I get the follow error after the last line:

Error in comparative.data(taxatree, sumdataPGLS, "Species") : 
  Duplicate tip labels present in phylogeny 

Suggesting the problem is purely with the phylogeny, not the dataframe.

Desired outcome:

A way to remove duplicate tip labels in R

Input data:

Unfortunately the tree is so large, I can't put it all in here, however here is a subset of the data (note, this will not work by itself), I am presenting it here in-case there are any systematic errors which are obvious to others:

(((('Acidilobus_saccharovorans':4,'Caldisphaera_lagunensis':4)Acidilobales:4,
('Sulfurisphaera_tokodaii':4,('Metallosphaera_hakonensis':4,
'Metallosphaera_sedula':4)Metallosphaera:4,('Acidianus_sulfidivorans':4,
'Acidianus_brierleyi':4)Acidianus:4,('Sulfolobus_metallicus':4,
'Sulfolobus_solfataricus':4,'Sulfolobus_acidocaldarius':4)Sulfolobus:4)
Sulfolobaceae:4,(('Pyrolobus_fumarii':4,'Hyperthermus_butylicus':4,
'Pyrodictium_occultum':4)Pyrodictiaceae:4,('Aeropyrum_camini':4,
('Ignicoccus_hospitalis':4,'Ignicoccus_islandicus':4)Ignicoccus:4,   
3

3 Answers

1
votes

One possible solution, as the issue appears to be the format of the tree being input into the 'phylo' class, in this case internal nodes have names, and some of these names are the same as genera.

A way to 'clean' the tree, is to format it, a way I found to work is through the python package: ete3 (http://etetoolkit.org/)

from ete3 import Tree
import sys

t = Tree(sys.argv[1], format=1)

t.write(format=5, outfile="test4.tre")

The useful function is t.write(format=5, format = 5, means it writes in a type acceptable for the comparitive.data function being used in R. In this case, without internal node names.

1
votes

I've runned into the same problem with my comparative data. I had:

maxillariinae <- comparative.data(tree_gs, data.000, spp_code, vcv=TRUE, vcv.dim=3)
>Error in comparative.data(tree_gs, data.000, spp_code, vcv = TRUE, vcv.dim = 3) : 
>Labels duplicated between tips and nodes in phylogeny

I've solved it in a very simple way:

# Removing node labels:
tree_gs$node.label<-NULL

And then when I tryed to set the comparative data, it just worked. The pgls I did next worked as well. I hope it works for you.

-1
votes

I ran into the same problem because my Newick tree included bootstrap support values in addition to distances. >comparative.data worked fine after removing the support values. (The bootstrap values were 0.97.. -0.99..) Here are the original and revised trees:

Original

((Alligator:0.09129139,(Turtle:0.12361699,(Lizard:0.18330984,
((TasmDevil:0.02519765,Opossum:0.01841396)0.998733:0.03121792,
(Armadillo:0.05330751,((Cow:0.12244558,Dog:0.07483858)0.983085:0.02485452,
(Mouse:0.14438626,GuineaPig:0.03974587)0.972224:0.02107559)0.889194:0.01974521)
0.99985:0.03529365)0.99985:0.18024398)0.988266:0.074151)0.974215:0.11888747)
:1.0964437,Frog:1.0964437):0.0;

Revised

((Alligator:0.09129139,(Turtle:0.12361699,(Lizard:0.18330984,
((TasmDevil:0.02519765,Opossum:0.01841396):0.03121792,
(Armadillo:0.05330751,((Cow:0.12244558,Dog:0.07483858):0.02485452,
(Mouse:0.14438626,GuineaPig:0.03974587):0.02107559):0.01974521):0.03529365)
:0.18024398):0.074151):0.11888747):1.0964437,Frog:1.0964437):0.0;