2
votes

I do have two sets of data: Abundance data and environmental data and need to "link" them or "overlay" them in a PCA:

I want to conduct a PCA in R which gives me the individuals Ciliate species as Individuals and the environmental parameters as variables. What I do have are two different dataframes. "abundance" where the abundance of the different species at the sampled sites are given and "environment" where the environmental parameters of the sampled sites are given. So I have one parameter in common in eacht dataframe: the sites! If i conduct PCAs I either get a plot with the sites as individuals and the environmental parameters as variables or a plot with the species as individuals and the sites as variables. What I need, would be to link the datasets by the parameter site so that I can conduct a PCA with the ciliate species as individuals ans the environmental parameters as variables. So i would somehow need to link the two PCAs/dataframes over the common parameter site. What I did so far- i did two different PCAs and rememberd the coordinates of the individuals (ciliate species) of PCA1 and the coordinates of the variables of PCA2 (environmental parameters) and biploted them-> The graph is exactly what I would need, but is it still interpretable as a PCA, so the dataframes really linked by the site parameter? Or did simply trick with the data and loose the interpretability?

Another option I tried is to calculate the environmental paramteres of each ciliate species by weighted mean (weighted by the abundance of the ciliates at the site) and conduct a PCA on a dataframe with ciliate species and the weighted mean of the environmental parameters...Which works but I think I loose a lot of information on this way...What do you think?

#Create random dataframe of abundance data, I am sure this can be done simpler and more elegant than this ;)
    species<-c("spec1", "spec2", "spec3", "spec 4", "spec 5", "spec 6", "spec7")
    site1<-c(2,4,19,34,3,6,9)
    site2<-c(5,8,9,12,0,1,1)
    site3<-c(23,56,7,1,1,1,2)
    site4<-c(4,6,2,8,5,1,7)
    abundance<-data.frame(species,site1,site2,site3,site4)
    rownames(abundance)<-abundance$species
    abundance<-abundance[,-1]
    #Create random dataframe of abundance data
    #environmental parameters of the sites
    X<-c("site1","site2","site3","site4")
    Temp<-c(24,24.5,23.5,25)
    Chla<-c(2.2,1.5,2.0,3.4)
    Plo<-c(1000,2000,1500,200)
    Plo2<-c(200,400,600,200)
    environment<-data.frame(X,Temp,Chla,Plo,Plo2)
    rownames(environment)<-environment$X
    environment<-environment[,-1]
    ###PCA on abundance data
    #hellinger pre-transformation of abundance data
    library(vegan)
    abu.h<-decostand(abundance,"hellinger")
    abu.h.pca<-prcomp(abu.h)
    envir.pca<-prcomp(environment,scale=TRUE)
    biplot(abu.h.pca)
    ##and now I would need to discard the sites vectors and overlay it with 
    #the environmental sites factors, due to my prof?
    #Graph of individuals 
    fviz_pca_ind(abu.h.pca) 
    ##get coordinates 
    library(factoextra)
    ind<-get_pca_ind(abu.h.pca) 
    head(ind$coord) 
    #x in biplot 
    ind<-ind$coord 
    ind<-ind[,1:2]
    ind 
    #y variables 
    # Extract the results for variables only

    vari<-get_pca_var(abu.h.pca) 
    var<-vari$coord 
    var<-var[,1:2] 
    var 
    biplot(ind, var, var.axes = TRUE) 
1
Seems more appropriate for CrossValidated.com where questions of statistical validity are more on-topic than is the case on SO.IRTFM
Thanks a lot, that is a good advise!Ta Ani

1 Answers

2
votes

I've never done what you're describing, but I know you can do a relate to the environmental (abiotic) data with vector overlays on top of an nMDS. If you can do that with a PCA, I'm not sure, but at least my PRIMER manual mentions that a PCA of abiotic data using Euclidean distance is a good fit with an nMDS of the biotic data, and is how PRIMER's BEST function works. But this isn't PRIMER.

See the vegan::envfit function. The intro vignette covers it briefly. Vegan tutor covers it a bit more.

I transposed the species data and did it using an nMDS of the species data.

library(vegan)

species <-c ("spec1", "spec2", "spec3", "spec 4", "spec 5", "spec 6", "spec7")
site1 <- c(2,4,19,34,3,6,9)
site2 <- c(5,8,9,12,0,1,1)
site3 <- c(23,56,7,1,1,1,2)
site4 <- c(4,6,2,8,5,1,7)
abundance <- data.frame(species,site1,site2,site3,site4)
rownames(abundance) <- abundance$species
abundance <- abundance[,-1]
abundance <- t(abundance)

X <- c ("site1","site2","site3","site4")
Temp <- c(24,24.5,23.5,25)
Chla <- c(2.2,1.5,2.0,3.4)
Plo <- c(1000,2000,1500,200)
Plo2 <- c(200,400,600,200)
environment <- data.frame(X,Temp,Chla,Plo,Plo2)
rownames(environment) <- environment$X
environment <- environment[,-1]

AbEnvMDS <- metaMDS(abundance, k = 2)
AbEnvFit <- envfit(AbEnvMDS, environment)

plt <- plot(AbEnvMDS) # displays both sites (empty circles) and species (red +)
plt <- plot(AbEnvMDS, display = "species") # displays only species (red +)
plt
identify(plt, what = "species") # choose your points
plot(AbEnvFit) # overlays your environment

Graph of results