0
votes

I am currently studying principal component analysis and playing around with the R prcomp function. My code is as follows:

library(dplyr)

iris1 = mutate( iris,
                   Species = factor( Species),
                   logSepalLength = log10( Sepal.Length ),
                   logSepalWidth = log10( Sepal.Width ),
                   logPetalLength = log10( Petal.Length ),
                   logPetalWidth = log10( Petal.Width ),
                   ) %>%
  dplyr::select(Species, starts_with("log") ) 

iris1.PCA = prcomp( ~ logSepalLength + 
                         logSepalLength + 
                         logSepalWidth + 
                         logPetalLength + 
                         logPetalWidth, 
                       data = iris1, scale. = FALSE ) 

summary(iris1.PCA)

The output of summary(iris1.PCA) is as follows:

Importance of components:
                          PC1     PC2     PC3     PC4
Standard deviation     0.4979 0.06009 0.05874 0.02337
Proportion of Variance 0.9702 0.01413 0.01350 0.00214
Cumulative Proportion  0.9702 0.98436 0.99786 1.00000

I want to use ggplot to generate a nice scree plot that shows the the cumulative contribution to total variance for each principal component. I can do this calculation manually, starting from the covariance matrix, using something like cumsum(eigenvals)/iris1.cov.trace. However, according to summary(iris1.PCA), the prcomp output already calculates the cumulative proportion for us! So how do we utilise that part of the summary(iris1.PCA) object and ggplot to generate a nice scree plot? I know we can manually copy the output values, but I'm looking for a more automated solution (since hard-copying values is not good software engineering practice).

I found this example of a scree plot using ggplot (although, it does not use cumulative contribution to total variance):

enter image description here

var_explained_df %>%
  ggplot(aes(x=PC,y=var_explained, group=1))+
  geom_point(size=4)+
  geom_line()+
  labs(title="Scree plot: PCA on scaled data")
1

1 Answers

1
votes

Here's an example using the output from the PCA. The sdev element from the summary is the standard deviation explained. The variance explained is the squared standard deviation (i.e., the variance) divided by the sum of all of the squared standard deviations.

s <- summary(iris1.PCA)
dat <- data.frame(
  component = factor(1:length(s$sdev), labels=paste0("PC", 1:length(s$sdev))),
  var_explained = s$sdev^2/sum(s$sdev^2)
)
library(scales)
ggplot(dat, aes(y=var_explained)) + 
  geom_line(aes(x=component, group=1)) + 
  geom_point(aes(x=component)) + 
  labs(x="Component", y="% Variance Explained") + 
  scale_y_continuous(labels=percent) + 
  theme_bw() + 
  ggtitle("Scree plot: PCA on Scaled Data")

enter image description here