I am currently trying to obtain equivalent results with the proc princomp command in SAS and the princomp() command in R (in the stats package). The results I am getting are very similar, leading me to suspect that this isn't a problem with different options settings in the two commands. However, the outpus are also different enough that the component scores for each data row are notably different. They are also sign-reversed, but this doesn't matter, of course.
The end goal of this analysis is to produce a set of coefficients from the PCA to score data outside the PCA routine (i.e. a formula that can be applied to new datasets to easily produce scored data).
Without posting all my data, I'm hoping someone can provide some information on how these two commands may differ in their calculations. I don't know enough about the PCA math to determine if this is a conceptual difference in the processes or just something like an internal rounding difference. For simplicity, I'll post the eigenvectors for PC1 and PC2 only.
In SAS:
proc princomp data=climate out=pc_out outstat=pc_outstat;
var MAT MWMT MCMT logMAP logMSP CMI cmiJJA DD_5 NFFD;
run;
returns
Eigenvectors
Prin1 Prin2 Prin3 Prin4 Prin5 Prin6 Prin7 Prin8 Prin9
MAT 0.372 0.257 -.035 -.033 -.106 0.270 -.036 0.216 -.811
MWMT 0.381 0.077 0.160 -.261 0.627 0.137 -.054 0.497 0.302
MCMT 0.341 0.324 -.229 0.046 -.544 0.421 0.045 0.059 0.493
logMAP -.184 0.609 -.311 -.357 -.041 -.548 0.183 0.183 0.000
logMSP -.205 0.506 0.747 -.137 -.040 0.159 -.156 -.266 0.033
CMI -.336 0.287 -.451 0.096 0.486 0.499 0.050 -.318 -.031
cmiJJA -.365 0.179 0.112 0.688 -.019 0.012 0.015 0.588 0.018
DD_5 0.379 0.142 0.173 0.368 0.183 -.173 0.725 -.282 0.007
NFFD 0.363 0.242 -.136 0.402 0.158 -.351 -.637 -.264 0.052
In R:
PCA.model <- princomp(climate[,c("MAT","MWMT","MCMT","logMAP","logMSP","CMI","cmiJJA","DD.5","NFFD")], scores=T, cor=T)
PCA.model$loadings
returns
Eigenvectors
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9
MAT -0.372 -0.269 0.126 -0.250 0.270 0.789
MWMT -0.387 -0.171 0.675 0.494 -0.325
MCMT -0.339 -0.332 0.250 0.164 -0.500 -0.414 -0.510
logMAP 0.174 -0.604 0.309 0.252 0.619 -0.213 0.125
logMSP 0.202 -0.501 -0.727 0.223 -0.162 0.175 -0.268
CMI 0.334 -0.293 0.459 -0.222 0.471 -0.495 -0.271
cmiJJA 0.365 -0.199 -0.174 -0.612 -0.247 0.590
DD.5 -0.382 -0.143 -0.186 -0.421 -0.695 -0.360
NFFD -0.368 -0.227 -0.487 0.309 0.655 -0.205
As you can see, the values are similar (sign reversed), but not identical. The differences matter in the scored data, the first row of which looks like this:
Prin1 Prin2 Prin3 Prin4 Prin5 Prin6 Prin7 Prin8 Prin9
SAS -1.95 1.68 -0.54 0.72 -1.07 0.10 -0.66 -0.02 0.05
R 1.61 -1.99 0.52 -0.42 -1.13 -0.16 0.79 0.12 -0.09
If I use a GLM (in SAS) or lm() (in R) to calculate the coefficients from the scored data, I get very similar numbers (inverse sign), with the exception of the intercept. Like so:
in SAS:
proc glm order=data data=pc_out;
model Prin1 = MAT MWMT MCMT logMAP logMSP CMI cmiJJA DD_5 NFFD;
run;
in R:
scored <- cbind(PCA.model$scores, climate)
pca.lm <- lm(Comp.1~MAT+MWMT+MCMT+logMAP+logMSP+CMI+cmiJJA+DD.5+NFFD, data=scored)
returns
Coefficients:
(Int) MAT MWMT MCMT logMAP logMSP CMI cmiJJA DD.5 NFFD
SAS 0.42 0.04 0.06 0.03 -0.65 -0.69 -0.003 -0.01 0.0002 0.004
R -0.59 -0.04 -0.06 -0.03 0.62 0.68 0.004 0.02 -0.0002 -0.004
So it would seem that the model intercept is changing the value in the scored data. Any thoughts on why this happens (why the intercept is different) would be appreciated.
?prcomp
: The signs of the columns of the rotation matrix are arbitrary, and so may differ between different programs for PCA, and even between different builds of R. – Bryan Hansonprcomp
vsprincomp
. They use different methods. One quick test is to change fromprincomp
to 'prcomp` and see if the results match the SAS version. But the details on these are quite technical, you may have trouble ferreting it out. – Bryan Hanson