3
votes

Picking up from where we left...

So I can use linalg.eig or linalg.svd to compute the PCA. Each one returns different Principal Components/Eigenvectors and Eigenvalues when they're fed the same data (I'm currently using the Iris dataset).

Looking here or any other tutorial with the PCA applied to the Iris dataset, I'll find that the Eigenvalues are [2.9108 0.9212 0.1474 0.0206]. The eig method gives me a different set of eigenvalues/vectors to work with which I don't mind, except that these eigenvalues, once summed, equal the number of dimensions (4) and can be used to find how much each component contributes to the total variance.

Taking the eigenvalues returned by linalg.eig I can't do that. For example, the values returned are [9206.53059607 314.10307292 12.03601935 3.53031167]. The proportion of variance in this case would be [0.96542969 0.03293797 0.00126214 0.0003702]. This other page says that ("The proportion of the variation explained by a component is just its eigenvalue divided by the sum of the eigenvalues.")

Since the variance explained by each dimension should be constant (I think), these proportions are wrong. So, if I use the values returned by svd(), which are the values used in all tutorials, I can get the correct percentage of variation from each dimension, but I'm wondering why the values returned by eig can't be used like that.

I assume the results returned are still a valid way to project the variables, so is there a way to transform them so that I can get the correct proportion of variance explained by each variable? In other words, can I use the eig method and still have the proportion of variance for each variable? Additionally, could this mapping be done only in the eigenvalues so that I can have both the real eigenvalues and the normalized ones?

Sorry for the long writeup btw. Here's a (::) for having gotten this far. Assuming you didn't just read this line.

4
You might want to post this on math.stackexchange.comS.Lott
@S.Lott I tried posting there once and they said the site was only for real advanced math and stuff so I'd rather not go there again unless I really have to.pcapcapcapca
Since this isn't programming (no code), you are less likely to get help here. A single question with snarky comments doesn't mean much. I'd start a search there for related questions before waiting around for an answer here. It's important to read a lot of questions to see how the questions are asked so you can fit in better. Example. "Sorry for the long writeup" is lame. If it's long, you can fix it to be short and to be point.S.Lott
@S.Lott my question is somewhat specific to python. i can ask there (i'm typing the question on the other tab) but i'm pretty confident they'll just tell me they can't help because of that. plus, i wrote a lot of text because often you get people asking to clarify what you mean, if the other questions i read are any indication. i'm pretty sure you're mad because you just read the last line and got no cookiepcapcapcapca
"i'm pretty sure you're mad". You'd be very wrong, then.S.Lott

4 Answers

4
votes

Taking Doug's answer to your previous question and implementing the following two functions, I get the output shown below:

def pca_eig(orig_data):
    data = array(orig_data)
    data = (data - data.mean(axis=0)) / data.std(axis=0)
    C = corrcoef(data, rowvar=0)
    w, v = linalg.eig(C)
    print "Using numpy.linalg.eig"
    print w
    print v

def pca_svd(orig_data):
    data = array(orig_data)
    data = (data - data.mean(axis=0)) / data.std(axis=0)
    C = corrcoef(data, rowvar=0)
    u, s, v = linalg.svd(C)
    print "Using numpy.linalg.svd"
    print u
    print s
    print v

Output:

Using numpy.linalg.eig
[ 2.91081808  0.92122093  0.14735328  0.02060771]
[[ 0.52237162 -0.37231836 -0.72101681  0.26199559]
 [-0.26335492 -0.92555649  0.24203288 -0.12413481]
 [ 0.58125401 -0.02109478  0.14089226 -0.80115427]
 [ 0.56561105 -0.06541577  0.6338014   0.52354627]]

Using numpy.linalg.svd
[[-0.52237162 -0.37231836  0.72101681  0.26199559]
 [ 0.26335492 -0.92555649 -0.24203288 -0.12413481]
 [-0.58125401 -0.02109478 -0.14089226 -0.80115427]
 [-0.56561105 -0.06541577 -0.6338014   0.52354627]]
[ 2.91081808  0.92122093  0.14735328  0.02060771]
[[-0.52237162  0.26335492 -0.58125401 -0.56561105]
 [-0.37231836 -0.92555649 -0.02109478 -0.06541577]
 [ 0.72101681 -0.24203288 -0.14089226 -0.6338014 ]
 [ 0.26199559 -0.12413481 -0.80115427  0.52354627]]

In both cases, I get the desired eigenvalues.

0
votes

Are you sure the data for both cases are the same and the correct order of dimensions(your not sending in the rotated array are you?)? I bet you'll find they both give the same results if you use them right ;)

0
votes

There are three ways I know of to do PCA: derived from an eigenvalue decomposition of the correlation matrix, the covariance matrix, or on the unscaled and uncentered data. It sounds like you are passing linalg.eig is working on the unscaled data. Anyway, that is just a guess. A better place for your question is stats.stackexchange.com. The folks on math.stackexchange.com don't use actual numbers. :)

0
votes

I'd suggest using SVD, singular value decomposition, for PCA, because
1) it gives you directly values and matrices you need
2) it's robust.
See principal-component-analysis-in-python on SO for an example with (surprise) iris data. Running it gives

read iris.csv: (150, 4)
Center -= A.mean: [ 5.84  3.05  3.76  1.2 ]
Center /= A.std: [ 0.83  0.43  1.76  0.76]

SVD: A (150, 4) -> U (150, 4)  x  d diagonal  x  Vt (4, 4)
d^2: 437 138 22.1 3.09
% variance: [  72.77   95.8    99.48  100.  ]
PC 0 weights: [ 0.52 -0.26  0.58  0.57]
PC 1 weights: [-0.37 -0.93 -0.02 -0.07]

You see that the diagonal matrix d from SVD, squared, gives the proportion of total variance from PC 0, PC 1 ...

Does this help ?