0
votes

I have a Dataframe as follows.....

df <-data.frame(variableA, variableB, variableC, variableD, variableE)

prcomp(scale(df))
summary(prcomp)

gives the following results

                          PC1    PC2    PC3     PC4     PC5
Cumulative Proportion  0.5127 0.7222 0.8938 0.96075 1.00000

Is there a way I add the PCs to a new dataframe, up until the Cumulative Proportion reaches 85%?

I have a number of Dataframes that I wish to run this for, they vary in size but I would like 85% to be an arbitrary cut off point.

3
Does summary(prcomp)[summary(prcomp) < 0.85] return the values you expect? - rg255
Does this answer your question? Interpreting PCA Results - Len Greski

3 Answers

0
votes

Without a bit more detail it's hard to say, but you might run into issues because the length of the resulting vector will differ from analysis to analysis. E.g. one might result in 4 principal components that meet your conditions, another 3 in principal components. Dataframes are in the other hand, rectangular, so each row must be the same length, and each column the same length, so you couldn't make a dataframe that had 3 columns in one row and 4 in another.

A couple of simple options with a vector like yours:

# your vector of pcs
x1 <- summary(prcomp)

1) Make a data frame that fits the maximum number of components, full of NAs, then replace accordingly retaining NAs where appropriate.

# storage df
outDF <- data.frame(matrix(rep(NA, 8), ncol = 4))
# store
outDF[1, x1 < 0.85] <- x1[x1 < 0.85]

2) Store as a list, because lists do not need to be rectangular

# storage list
outList <- list()
# store
outList[[1]] <- x1[x1 < 0.85]
0
votes

You can extract the threshold from the summary, e.g. like this:

getMinPCs <- function(mat, thresh=.85){
    return(which(summary(prcomp(scale(mat)))$importance["Cumulative Proportion",] >= thresh)[1])
}

(although, obviously, you may wish to run prcomp only once, and do something in addition with the subset within the function)

It is not clear to me what you want in a new data.frame - maybe the rotation matrix, which you could then subset and return - or subset and return the whole list:

getMinPCrotations <- function(mat, thresh=.85){
    res_pca <- prcomp(scale(mat))
    nPCs <- which(summary(res_pca)$importance["Cumulative Proportion",] >= thresh)[1]
    sub <- list(sdev=res_pca$sdev[seq_len(nPCs)], 
                rotation=res_pca$rotation[, seq_len(nPCs)],
                center=res_pca$center[seq_len(nPCs)],
                scale=res_pca$scale[seq_len(nPCs)],
                x=res_pca$x[, seq_len(nPCs)]
    )
    # setattr(sub, "class", "prcomp")
    return(sub)
}
0
votes

Here is an approach to identify the components explaining up to 85% variance, using the spam data from the kernlab package.

library(kernlab)
data(spam)
# log transform independent variables, ensuring all values above 0
princomp <- prcomp(log10(spam[,-58]+1))
stats <- summary(princomp)
# extract variable importance and list items explaining up to 85% variance
importance <- stats$importance[3,]
importance[importance <= 0.85]

...and the output:

> importance[importance <= 0.85]
    PC1     PC2     PC3     PC4     PC5     PC6     PC7     PC8     PC9    PC10    PC11 
0.49761 0.58021 0.63101 0.67502 0.70835 0.73188 0.75100 0.76643 0.78044 0.79368 0.80648 
   PC12    PC13    PC14 
0.81886 0.83046 0.84129 
>

We can obtain the factor scores for the first 14 components and save them as a data frame as follows.

resultNames <- names(importance[importance <= 0.85])
# return factor scores 
x_result <- as.data.frame(princomp$x[,resultNames])
head(x_result)

...and the output:

> head(x_result)
         PC1         PC2          PC3          PC4          PC5         PC6         PC7
1  0.7364988  0.19181730  0.041818854 -0.009236399  0.001232911  0.03723833 -0.01144332
2  1.3478167  0.22953561 -0.149444409  0.091569400 -0.148434128 -0.01923707 -0.07119210
3  2.0489632 -0.02668038  0.222492079 -0.107120738 -0.092968198 -0.06400683 -0.07078830
4  0.4912016  0.20921288 -0.002072148  0.015524007 -0.002347262 -0.14519336 -0.09238828
5  0.4911676  0.20916725 -0.002122664  0.015467369 -0.002373622 -0.14517812 -0.09243136
6 -0.2337956 -0.10508875  0.187831101 -0.335491660  0.099445713  0.09516875  0.11234080
          PC8          PC9        PC10        PC11        PC12         PC13        PC14
1 -0.08745771  0.079650230 -0.14450436  0.15945517 -0.06490913 -0.042909658  0.05739735
2  0.00233124 -0.091471125 -0.10304536  0.06973190  0.09373344  0.003069536  0.02892939
3 -0.10888375  0.227437609 -0.07419313  0.08217271 -0.12488575  0.150950134  0.05180459
4 -0.15862241  0.003044418  0.01609690  0.01720151  0.02313224  0.142176889 -0.04013102
5 -0.15848785  0.002944493  0.01606874  0.01725410  0.02304496  0.142527110 -0.04007788
6 -0.13790588  0.197294502  0.07851300 -0.08131269 -0.02091459  0.246810914 -0.01869192
> 

To merge the data with the original data frame, we can use cbind().

mergedData <- cbind(spam,x_result)