0
votes

I am currently trying to obtain equivalent results with the proc princomp command in SAS and the princomp() command in R (in the stats package). The results I am getting are very similar, leading me to suspect that this isn't a problem with different options settings in the two commands. However, the outpus are also different enough that the component scores for each data row are notably different. They are also sign-reversed, but this doesn't matter, of course.

The end goal of this analysis is to produce a set of coefficients from the PCA to score data outside the PCA routine (i.e. a formula that can be applied to new datasets to easily produce scored data).

Without posting all my data, I'm hoping someone can provide some information on how these two commands may differ in their calculations. I don't know enough about the PCA math to determine if this is a conceptual difference in the processes or just something like an internal rounding difference. For simplicity, I'll post the eigenvectors for PC1 and PC2 only.

In SAS:

proc princomp data=climate out=pc_out outstat=pc_outstat; 
var MAT MWMT MCMT logMAP logMSP CMI cmiJJA DD_5 NFFD; 
run;

returns

Eigenvectors
       Prin1  Prin2  Prin3  Prin4  Prin5  Prin6  Prin7  Prin8  Prin9 
MAT    0.372  0.257  -.035  -.033  -.106  0.270  -.036  0.216  -.811 
MWMT   0.381  0.077  0.160  -.261  0.627  0.137  -.054  0.497  0.302 
MCMT   0.341  0.324  -.229  0.046  -.544  0.421  0.045  0.059  0.493 
logMAP -.184  0.609  -.311  -.357  -.041  -.548  0.183  0.183  0.000 
logMSP -.205  0.506  0.747  -.137  -.040  0.159  -.156  -.266  0.033 
CMI    -.336  0.287  -.451  0.096  0.486  0.499  0.050  -.318  -.031 
cmiJJA -.365  0.179  0.112  0.688  -.019  0.012  0.015  0.588  0.018 
DD_5   0.379  0.142  0.173  0.368  0.183  -.173  0.725  -.282  0.007 
NFFD   0.363  0.242  -.136  0.402  0.158  -.351  -.637  -.264  0.052 

In R:

PCA.model <- princomp(climate[,c("MAT","MWMT","MCMT","logMAP","logMSP","CMI","cmiJJA","DD.5","NFFD")], scores=T, cor=T)
PCA.model$loadings

returns

Eigenvectors
       Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9
MAT    -0.372 -0.269         0.126        -0.250         0.270  0.789
MWMT   -0.387        -0.171         0.675                0.494 -0.325
MCMT   -0.339 -0.332  0.250  0.164 -0.500 -0.414               -0.510
logMAP  0.174 -0.604  0.309  0.252         0.619 -0.213  0.125       
logMSP  0.202 -0.501 -0.727  0.223        -0.162  0.175 -0.268       
CMI     0.334 -0.293  0.459 -0.222  0.471 -0.495        -0.271       
cmiJJA  0.365 -0.199 -0.174 -0.612 -0.247                0.590       
DD.5   -0.382 -0.143 -0.186 -0.421               -0.695 -0.360       
NFFD   -0.368 -0.227        -0.487         0.309  0.655 -0.205  

As you can see, the values are similar (sign reversed), but not identical. The differences matter in the scored data, the first row of which looks like this:

     Prin1  Prin2  Prin3  Prin4  Prin5  Prin6  Prin7  Prin8  Prin9 
SAS  -1.95   1.68  -0.54   0.72  -1.07   0.10  -0.66  -0.02   0.05
R     1.61  -1.99   0.52  -0.42  -1.13  -0.16   0.79   0.12  -0.09

If I use a GLM (in SAS) or lm() (in R) to calculate the coefficients from the scored data, I get very similar numbers (inverse sign), with the exception of the intercept. Like so:

in SAS:

proc glm order=data data=pc_out;
model Prin1 = MAT MWMT MCMT logMAP logMSP CMI cmiJJA DD_5 NFFD;
run;

in R:

scored <- cbind(PCA.model$scores, climate)
pca.lm <- lm(Comp.1~MAT+MWMT+MCMT+logMAP+logMSP+CMI+cmiJJA+DD.5+NFFD, data=scored)

returns

    Coefficients:
    (Int)  MAT    MWMT   MCMT   logMAP  logMSP  CMI     cmiJJA  DD.5     NFFD 
SAS  0.42   0.04   0.06   0.03  -0.65   -0.69   -0.003  -0.01    0.0002   0.004
R   -0.59  -0.04  -0.06  -0.03   0.62    0.68    0.004   0.02   -0.0002  -0.004

So it would seem that the model intercept is changing the value in the scored data. Any thoughts on why this happens (why the intercept is different) would be appreciated.

1
may be this is a question for stats.stackexchange.comAnanta
The fact that one result is -1 * the other result is meaningless. From ?prcomp: The signs of the columns of the rotation matrix are arbitrary, and so may differ between different programs for PCA, and even between different builds of R.Bryan Hanson
Also, look at the help pages for prcomp vs princomp. They use different methods. One quick test is to change from princomp to 'prcomp` and see if the results match the SAS version. But the details on these are quite technical, you may have trouble ferreting it out.Bryan Hanson
@BryanHanson Thanks for you comments. Yes, I'm aware that the sign is irrelevant (as I stated). I have read the help files as well as other documentation and, as you say, am having trouble ferreting out the differences. That's precisely why I'm asking here. I have run the data with prcomp also (should have mentioned that) and (with scale.=T) it seems equivalent to princomp in R (same output). So my question still pertains to the difference between the R and SAS PCA routines.David Roberts
You may know this, but if you just type 'getAnywhere(prcomp.default)` or 'getAnywhere(princomp.default)` you'll see the code that drives these. But if you look up on the internet a sort of minimal set of steps to do PCA you'll see that both of these functions are highly optimized and very difficult to pick apart. That's why I think your quest is tricky.Bryan Hanson

1 Answers

3
votes

Thanks again to all those who commented. Embarrassingly, the differences I found between the SAS proc princomp and R princomp() procedures was actually a product of a data error that I made. Sorry to those who took time to help answer.

But rather than let this question go to waste, I will offer what I found to be statistically equivalent procedures for SAS and R when running a principal component analysis (PCA).

The following procedures are statistically equivalent, with data named 'mydata' and variables named 'Var1', 'Var2', and 'Var3'.

In SAS:

* Run the PCA on your data;
proc princomp data=mydata out=pc_out outstat=pc_outstat; 
var Var1 Var2 Var3; 
run;
* Use GLM on the individual components to obtain the coefficients to calculate the PCA scoring;
proc glm order=data data=pc_out;
model Prin1 = Var1 Var2 Var3;
run;

In R:

PCA.model <- princomp(mydata[,c("Var1","Var2","Var3")], scores=T, cor=T)
scored <- predict(PCA.model, mydata)
scored <- cbind(PCA.model$scores, mydata)
lm(Comp.1~Var1+Var2+Var3, data=scored)