0
votes

My question is: in the case of having a matrix we want to do PCA on, where the number of features greatly outnumbers the number of trials, why doesn't prcomp behave as expected (or am I missing something)?

Below is a summary of the issue, full code is here, compressed 7MB data source is here (is 55MB uncompressed), target image is here.

My exact situation is that I have a matrix p by n matrix X (p = features, n = trials) where the trials are photos taken of faces, and the features are the pixels in the photos (so a 32256 by 148 matrix). What I want to do is find the principal component score vectors of that matrix. Since finding the covariance matrix XX^T is too expensive, an easy solution is to find the eigenvectors (v_i) of X^TX and transform them by X (Xv_i) more info.

XTX <- t(X) %*% X # missing the 1 / n - 1 for cov matrix b/c we normalize later anyway
eigen <- eigen(XTX)
eigenvectors.XTX.col <- eigen$vectors

principal.component.scores <- apply(eigenvectors.XTX.col, 2, function(c) {
  normalize.vector(X %*% matrix(c, ncol = 1))
})

The principal component scores are eigenfaces in my case, and can be used to successfully reconstruct the target face as seen here: http://cl.ly/image/260w0N0u0Z3y (refer to my full code for how)

Passing X to prcomp should do something equivalent, but has a different result than the above homegrown way:

pca <- prcomp(X)
pca$x # right size, but wrong pc scores

The result of using pca$x in reconstructing the face is not total crap, but much worse: http://cl.ly/image/2p19360u2P43

I also checked that using prcomp on t(x) yielded a different rotation matrix, so prcomp is doing something fancy, but something mysterious under the hood. I know from here that prcomp is using SVD to calculate the principal component loading vectors instead of eigen decomposition, but that should not be leading to any errors here (or so I think...).

What is the correct way of using the built in prcomp method, there must be a way, right?

1
Due to "maths", the method can not solve your problem. You should have at least as many observations as variables, but preferably more of the former. - Roman Luštrik
Thanks for the comment @RomanLuštrik! This source corroborates what you said for the method princomp, but as I was surprised to find in the answer below, it is not the case that you need more observations than variables (this would be impossible in some disciplines like genomics, or even here with facial recognition). Rather the issue here is some annoying defaults in prcomp as well as a lack of normalization. Sorry for the lame problem in my question! - polpetti

1 Answers

1
votes

Wow, the answer is not a fun one at all, and rather has to do with default parameters in the prcomp method:

To solve this issue, first, I looked at the R source of prcomp and saw that the rotation matrix should equal svd(X)$v. Checking this on the R command line proved that with my X (data here) it did not. This is because even though there is the default param scale = F to prcomp, prcomp will still run the R scale method, if only to center the matrix, which is default True as seen here. In my case, this is bad because I passed the data as already centered (subtracted mean image).

So, rerunning with prcomp(X, center = F) will yield a rotation matrix equal to svd(X)$v as expected. From this point forward, the only "mistake" prcomp makes when constructing prcomp(X, center = F)$x will be to not normalize the columns, so they are each only off by a scalar multiple from the principal.component.scores matrix I reference above in my code. Without normalizing prcomp(X, center = F)$x the results are better, but not quite great as seen here: http://cl.ly/image/3u2y3m1h2S0o

But after normalizing via pca.x.norm <- apply(pca$x, 2, normalize.vector) the results of prcomp in reconstructing the face is identical: http://cl.ly/image/24390O3x0A0x

tl;dr - prcomp unexpectedly centers the data even with the param scale = F, plus for the purposes of eigenfaces you will need to normalize the columns of prcomp(X, center = F)$x, then everything will work as desired!