28
votes

I have been comparing the performance of several PCA implementations from both Python and R, and noticed an interesting behavior:
While it seems impossible to compute the PCA of a sparse matrix in Python (the only approach would be scikit-learn's TruncatedSVD, yet it does not support the mean-centering required to be equivalent to a covariance solution for PCA. Their argumentation is, that it would destroy the sparsity property of the matrix. Other implementations like Facebook's PCA algorithm or the PCA/randomPCA method in scikit learn do not support sparse matrices for similar reasons.

While all of that makes sense to me, several R packages, like irlba, rsvd, etc., are able to handle sparse matrices (e.g. generated with rsparsematrix), and even allow for specific center=True arguments.

My question is, how R handles this internally, as it seems to be vastly more efficient than the comparable Python implementation. Does R still maintain the sparsity by doing Absolute Scaling instead (which would theoretically falsify the results, but at least maintain sparsity)? Or is there any way in which the mean can be stored explicitly for the zero values, and is only stored once (instead of for every value separately)?

To get put off hold: How does R internally store matrices with mean-centering without exploding RAM usage. Hope that is concise enough....

1
This is an interesting question, but I'm not 100% sure SO is the best place to ask it. You might consider asking on Cross Validated, where I think you're more likely to get an answer.piman314
Thanks for the hint. I was considering SO, since it might be tagged as off topic in Cross Validated. Maybe I will ask there, too, if it remains unanswereddennlinger
I think the answer will be found in ?irlba: "Use the optional ‘center’ parameter to implicitly subtract the values in the ‘center’ vector from each column of ‘A’, computing the truncated SVD of ‘sweep(A, 2, center, FUN=-)’, without explicitly forming the centered matrix" (emphasis added; in other words, it's an algorithmic trick rather than a storage trick). Then you have to look at the code: github.com/bwlewis/irlba/blob/master/R/irlba.R to see how the center argument is actually used within the algorithm.Ben Bolker
Maybe you can have a look at thisrpanai
Thanks for the link, but I am not entirely sure how this is supposed to help? Sparse matrices are not even mentioned in the article, and the code is purely based on python... I already know that Python does not support sparse handling (at least not the "efficient" packages of scikit-learn.dennlinger

1 Answers

3
votes

The key here is that the underlying implementation for the partial SVD (restarted Lanczos bidiagonalization C code) doesn't store the matrix. You instead record the result of the linear operation from the matrix applied to a small set of vectors obtained from the previous iteration.

Rather than explaining the concrete method used in the c code, which is quite advanced (see paper for description),I will explain it with a much simpler algorithm that captures the key idea in terms of how to preserve the efficiency from sparsity: the power method (or the subspace iteration method for its generalization to multiple eigenvalues). The algorithm returns the largest eigenvalue of a matrix A by iteratively applying a linear operator, then normalizing (or orthogonalizing a small set of vectors, in the case of subspace iteration)

What you do at every iteration is

v=A*v
v=v/norm(v)

The matrix multiplication step is the crucial one, so let's see what happens when we try the same thing with a centered A. The matrix formula for centered A (with center as the vector with the mean column values and ones as the vector of ones) is:

A_center=A-ones*transpose(center)

So if we apply the iterative algorithm to this new matrix we would get

v=A*v-dotproduct(center,v)*ones

Since A was sparse we can use the sparse matrix-vector product on (A,v) and -dotproduct(center,v)*ones just entails subtracting the dot product of center and v from the resulting vector which is linear on the dimension of A.