0
votes

Intro: Taking the model from one data set and applying to another data set to find an RMSE.

Say, I have dataset "data100"

And run the following selection operation to determine significant variables:

PROC REG DATA =data100;
model y= x0-x999 / selection=forward SLENTRY=.01;
run;quit;

It returns that x0 x10 x20 x30 x40 x50 x60 x70 x80 x90 are significant at <.0001. Ok. Now, I want to use this model in another data set "data1000".

Why couldn't I then just use:

PROC REG DATA =data1000;
model y= x0 x10 x20 x30 x40 x50 x60 x70 x80 x90;
run;quit;

To determine the RMSE of the data1000 set?


The reason this came up is that a mentor told me to use:

proc reg=data100 outest=data100est;
model y= x0-x999;
run;quit;

proc score data=data1000 score=data100est out=data1000p residual type=parms;
var y x0-x999;
run;

proc univariate data=data1000P;
var model1;
output out=data1000stat uss=ss1;
run;

data data1000stat;
set data1000stat;
rmse=sqrt(ss1/1000);
run;

proc print data=data1000stat;
run;quit;

I'm very confused about this point and if anyone can clarify the why or even if proc score is appropriate here, that would be great.

1
Definitely flag to migrate to Cross Validated - this is a question for a statistician, not a programmer.Joe

1 Answers

2
votes

This is probably better asked on the Stats forum. But since you asked...

When you run the second REG statement, you are refitting the model. The estimated betas will be different from the betas you got in the first REG statement. You are rerunning the regression and by definition getting the MINIMUM RMSE for those data.

The second method keeps the betas from the first regression and applies them to the second. The RMSE you calculate here will show you how well your 100 data modeled the 1000 data.

In the end, both are informative. The difference between the two RMSE show you how well the 100 predict the 1000.