0
votes

I have performed PCA Analysis using the prcomp function apart of the FactoMineR package on quite a substantial dataset of 3000 x 500.

I have tried plotting the main Principal Components that cover up to 100% of cumulative variance proportion with a fviz_eig plot. However, this is a very large plot due to the large dimensions of the dataset. Is there any way in R to split a plot into multiple plots using a for loop or any other way?

Here is a visual of my plot that only cover 80% variance due to the fact it being large. Could I split this plot into 2 plots?

Large Dataset Visualisation

I have tried splitting the plot up using a for loop...

for(i in data[1:20]) {
  fviz_eig(data, addlabels = TRUE, ylim = c(0, 30))
}

But this doesn't work.

Edited Reproducible example:

This is only a small reproducible example using an already available dataset in R but I used a similar method for my large dataset. It will show you how the plot actually works.

# Already existing data in R.
install.packages("boot")
library(boot)
data(frets)
frets

dataset_pca <- prcomp(frets)
dataset_pca$x

fviz_eig(dataset_pca, addlabels = TRUE, ylim = c(0, 100))

PCA Analysis Fviz_eig plot

However, my large dataset has a lot more PCs that this one (possibly 100 or more to cover up to 100% of cumulative variance proportion) and therefore this is why I would like a way to split the single plot into multiple plots for better visualisation.

Update:

I have performed what was said by @G5W below...

 data <- prcomp(data, scale = TRUE, center = TRUE)

 POEV = data$sdev^2 / sum(data$sdev^2)
 barplot(POEV, ylim=c(0,0.22))

 lines(0.7+(0:10)*1.2, POEV, type="b", pch=20)
 text(0.7+(0:10)*1.2, POEV, labels = round(100*POEV, 1), pos=3)

 barplot(POEV[1:40], ylim=c(0,0.22), main="PCs 1 - 40")
 text(0.7+(0:6)*1.2, POEV[1:40], labels = round(100*POEV[1:40], 1),
 pos=3)

and I have now got a graph as follows...

Graph

But I am finding it difficult getting the labels to appear above each bar. Can someone help or suggest something for this please?

1
If you use the built in prcomp function and then something like ggfortify cran.r-project.org/web/packages/ggfortify/vignettes/… the plots will automatically be split into 2x PCAs per plot (1 on the vertical axis, the other on the horizontal)Greg
It's easier to help you if you include a simple reproducible example with sample input and desired output that can be used to test and verify possible solutions.MrFlick
@MrFlick please see edit. Hopefully this helps!Matt.W
Your example is a step in the right direction, but I am not quite sure what you want. Why isn't the graph that you show good enough? What kind of splitting do you want? Maybe 10 PCAs at a time?G5W
@G5W yes for this particular reproducible example it is good enough. However, for my larger dataset I need more than 70 PCAs to make up 100%. Therefore, this makes a very crowded graph and so I need to split the graph into multiple graphs i.e. 10 PCAs at a time for example. Any suggestions would be great!Matt.W

1 Answers

0
votes

I am not 100% sure what you want as your result, but I am 100% sure that you need to take more control over what is being plotted, i.e. do more of it yourself. So let me show an example of doing that. The frets data that you used has only 4 dimensions so it is hard to illustrate what to do with more dimensions, so I will instead use the nuclear data - also available in the boot package. I am going to start by reproducing the type of graph that you displayed and then altering it.

library(boot)
data(nuclear)
N_PCA = prcomp(nuclear)
plot(N_PCA)

Basic PCA plot

The basic plot of a prcomp object is similar to the fviz_eig plot that you displayed but has three main differences. First, it is showing the actual variances - not the percent of variance explained. Second, it does not contain the line that connects the tops of the bars. Third, it does not have the text labels that tell the heights of the boxes.

Percent of Variance Explained. The return from prcomp contains the raw information. str(N_PCA) shows that it has the standard deviations, not the variances - and we want the proportion of total variation. So we just create that and plot it.

POEV = N_PCA$sdev^2 / sum(N_PCA$sdev^2)
barplot(POEV, ylim=c(0,0.8))

Percent of Variance Explained

This addresses the first difference from the fviz_eig plot. Regarding the line, you can easily add that if you feel you need it, but I recommend against it. What does that line tell you that you can't already see from the barplot? If you are concerned about too much clutter obscuring the information, get rid of the line. But just in case, you really want it, you can add the line with

lines(0.7+(0:10)*1.2, POEV, type="b", pch=20)

Scree plot with line

However, I will leave it out as I just view it as clutter.

Finally, you can add the text with

text(0.7+(0:10)*1.2, POEV, labels = round(100*POEV, 1), pos=3)

Scree Plot with text labels

This is also somewhat redundant, but particularly if you change scales (as I am about to do), it could be helpful for making comparisons.

OK, now that we have the substance of your original graph, it is easy to separate it into several parts. For my data, the first two bars are big so the rest are hard to see. In fact, PC's 5-11 show up as zero. Let's separate out the first 4 and then the rest.

barplot(POEV[1:4], ylim=c(0,0.8), main="PC 1-4")
text(0.7+(0:3)*1.2, POEV[1:4], labels = round(100*POEV[1:4], 1),
     pos=3)

barplot(POEV[5:11], ylim=c(0,0.0001), main="PC 5-11")
text(0.7+(0:6)*1.2, POEV[5:11], labels = round(100*POEV[5:11], 4),
     pos=3, cex=0.8)

Scree plot in two pieces

Now we can see that even though PC 5 is much smaller that any of 1-4, it is a good bit bigger than 6-11.

I don't know what you want to show with your data, but if you can find an appropriate way to group your components, you can zoom in on whichever PCs you want.