I have been comparing the performance of several PCA implementations from both Python and R, and noticed an interesting behavior:
While it seems impossible to compute the PCA of a sparse matrix in Python (the only approach would be scikit-learn's TruncatedSVD, yet it does not support the mean-centering required to be equivalent to a covariance solution for PCA.
Their argumentation is, that it would destroy the sparsity property of the matrix. Other implementations like Facebook's PCA algorithm or the PCA/randomPCA method in scikit learn do not support sparse matrices for similar reasons.
While all of that makes sense to me, several R packages, like irlba, rsvd, etc., are able to handle sparse matrices (e.g. generated with rsparsematrix
), and even allow for specific center=True
arguments.
My question is, how R handles this internally, as it seems to be vastly more efficient than the comparable Python implementation. Does R still maintain the sparsity by doing Absolute Scaling instead (which would theoretically falsify the results, but at least maintain sparsity)? Or is there any way in which the mean can be stored explicitly for the zero values, and is only stored once (instead of for every value separately)?
To get put off hold: How does R internally store matrices with mean-centering without exploding RAM usage. Hope that is concise enough....
?irlba
: "Use the optional ‘center’ parameter to implicitly subtract the values in the ‘center’ vector from each column of ‘A’, computing the truncated SVD of ‘sweep(A, 2, center, FUN=-
)’, without explicitly forming the centered matrix" (emphasis added; in other words, it's an algorithmic trick rather than a storage trick). Then you have to look at the code: github.com/bwlewis/irlba/blob/master/R/irlba.R to see how thecenter
argument is actually used within the algorithm. – Ben Bolker