1
votes

Link to data as txt file here

I'm having trouble with this PCA. PC1 results appear binary, and I can't figure out why as none of my variables are binary.

df = bees

pca_dat_condition <- bees %>% ungroup() %>%
  select(Length.1:Length.25, OBJECTID, Local, Elevation, Longitude, 
  Latitude, Cubital.Index)   %>% 
  na.omit()

pca_dat_first <- pca_dat_condition %>%      #remove the final nonnumerical information 
  select(-Local, -OBJECTID, -Elevation, -Longitude, -Latitude) 

pca <- pca_dat_first%>%   
  scale()  %>%
  prcomp()

# add identifying information back into PCA data
pca_data <- data.frame(pca$x, Local=pca_dat_condition$Local, ID = 
pca_dat_condition$OBJECTID, elevation = pca_dat_condition$Elevation, 
    Longitude = pca_dat_condition$Longitude, Latitude = 
    pca_dat_condition$Latitude)
ggplot(pca_data, aes(x=PC1, y=PC2, color = Latitude)) + 
   geom_point() +ggtitle("PC1 vs PC2: All Individuals") +
   scale_colour_gradient(low = "blue", high = "red")

I'm not getting any error messages with the code, and when I look at the data frame nothing looks out of place. Should I be using a different function for the PCA? Any insight into why my graph may look like this?

Previously, I did the same PCA but for the average values for each Local (whereas this is each individual), and it came out as a normal PCA with no clear clustering. I don't understand why this problem would arise when looking at individual points. It's possible I merged some other data frames in a wonky way, but the structure of the dataset seems completely normal.

This is how the PCA looks.

1
Have you done any preliminary exploratory analysis? The distribution of of the length data seems odd. Try hist(bees[, 1]).AkselA
My guess would be some kind of parsing or conversion error. Or maybe a mixup of units? centimetre vs micrometre?AkselA
Another odd relation. Try HIGH1 = which(pca_data$PC1 > 5); ggplot(pca_data[-HIGH1,], aes(x=PC1, y=PC2, color = Latitude)) + geom_point() +ggtitle("PC1 vs PC2: All Individuals") + scale_colour_gradient(low = "blue", high = "red")G5W
The outliers a among length seems to be limited to three individuals. Removing those from the data set appears to make things look a bit better.AkselA
Double checked the units, they're all the same there. I'm looking into the length and hist(), but am not getting anything insightful from those.wissem

1 Answers

2
votes
bees <- read.csv(paste0("https://gist.githubusercontent.com/AkselA/", 
                    "08a4e78a6a29a918ed597e9a32adc228/raw/", 
                    "6d0005fad4cb91830bcf7087176283b18683e9cd/bees.csv"), 
                    header=TRUE)

# bees <- bees[bees[,1] < 10,]  # This will remove the three offending rows
bees <- na.omit(bees)

bees.cond <- bees[, grep("Length|OBJ|Loc|Ele|Lon|Lat|Cubi", colnames(bees))]

bees.first <- bees[, grep("Length|Cubi", colnames(bees))]
summary(bees.first)
par(mfrow=c(7, 4), mar=rep(1, 4))
q <- lapply(1:ncol(bees.first), function(x) {
    h <- hist(scale(bees.first[, x]), plot=FALSE)
    h$counts <- log1p(h$counts)
    plot(h, main="", axes=FALSE, ann=FALSE)
    legend("topright", legend=names(bees.first[x]), 
      bty="n", cex=0.8, adj=c(0, -2), xpd=NA)
    })

bees.pca <- prcomp(bees.first, scale.=TRUE)
biplot(bees.pca)

Before removing the outliers

enter image description here

enter image description here

After

enter image description here

enter image description here