R PCA makes graph that is fishy, can't ID why

Question

I'm having trouble with this PCA. PC1 results appear binary, and I can't figure out why as none of my variables are binary.

df = bees

pca_dat_condition <- bees %>% ungroup() %>%
  select(Length.1:Length.25, OBJECTID, Local, Elevation, Longitude, 
  Latitude, Cubital.Index)   %>% 
  na.omit()

pca_dat_first <- pca_dat_condition %>%      #remove the final nonnumerical information 
  select(-Local, -OBJECTID, -Elevation, -Longitude, -Latitude) 

pca <- pca_dat_first%>%   
  scale()  %>%
  prcomp()

# add identifying information back into PCA data
pca_data <- data.frame(pca$x, Local=pca_dat_condition$Local, ID = 
pca_dat_condition$OBJECTID, elevation = pca_dat_condition$Elevation, 
    Longitude = pca_dat_condition$Longitude, Latitude = 
    pca_dat_condition$Latitude)
ggplot(pca_data, aes(x=PC1, y=PC2, color = Latitude)) + 
   geom_point() +ggtitle("PC1 vs PC2: All Individuals") +
   scale_colour_gradient(low = "blue", high = "red")

I'm not getting any error messages with the code, and when I look at the data frame nothing looks out of place. Should I be using a different function for the PCA? Any insight into why my graph may look like this?

Previously, I did the same PCA but for the average values for each Local (whereas this is each individual), and it came out as a normal PCA with no clear clustering. I don't understand why this problem would arise when looking at individual points. It's possible I merged some other data frames in a wonky way, but the structure of the dataset seems completely normal.

This is how the PCA looks.

Have you done any preliminary exploratory analysis? The distribution of of the length data seems odd. Try hist(bees[, 1]). — AkselA
My guess would be some kind of parsing or conversion error. Or maybe a mixup of units? centimetre vs micrometre? — AkselA
Another odd relation. Try HIGH1 = which(pca_data$PC1 > 5); ggplot(pca_data[-HIGH1,], aes(x=PC1, y=PC2, color = Latitude)) + geom_point() +ggtitle("PC1 vs PC2: All Individuals") + scale_colour_gradient(low = "blue", high = "red") — G5W
The outliers a among length seems to be limited to three individuals. Removing those from the data set appears to make things look a bit better. — AkselA
Double checked the units, they're all the same there. I'm looking into the length and hist(), but am not getting anything insightful from those. — wissem

AkselA AkselA · Accepted Answer · 2017-12-01T00:41:08

bees <- read.csv(paste0("https://gist.githubusercontent.com/AkselA/", 
                    "08a4e78a6a29a918ed597e9a32adc228/raw/", 
                    "6d0005fad4cb91830bcf7087176283b18683e9cd/bees.csv"), 
                    header=TRUE)

# bees <- bees[bees[,1] < 10,]  # This will remove the three offending rows
bees <- na.omit(bees)

bees.cond <- bees[, grep("Length|OBJ|Loc|Ele|Lon|Lat|Cubi", colnames(bees))]

bees.first <- bees[, grep("Length|Cubi", colnames(bees))]
summary(bees.first)
par(mfrow=c(7, 4), mar=rep(1, 4))
q <- lapply(1:ncol(bees.first), function(x) {
    h <- hist(scale(bees.first[, x]), plot=FALSE)
    h$counts <- log1p(h$counts)
    plot(h, main="", axes=FALSE, ann=FALSE)
    legend("topright", legend=names(bees.first[x]), 
      bty="n", cex=0.8, adj=c(0, -2), xpd=NA)
    })

bees.pca <- prcomp(bees.first, scale.=TRUE)
biplot(bees.pca)

R PCA makes graph that is fishy, can't ID why

1 Answers

Before removing the outliers

After