0
votes

I've been using Proc GlmSelect and the cross validation feature, because I have a fairly small sample size.

I pick the the model based on the lowest CVPRESS (predicted sum of errors). The output produces a "final" Parameter Estimates for all variables, as well as Parameter Estimates for each cross validation and variable.

However, the "final" parameter estimate is not equivalent to the average, nor a weighted average where I weight by either the size of the test set or the validation set.

I've looked through a lot of SAS instructions, but I'm unable to find any explanation of how the final parameter estimates are derived from the different cross validations.

Would be very thankful for an answer or something that would point me in the right direction.

Br,

1

1 Answers

1
votes

your question actually points rather to the nature of cross-validation than PROC GLMSELECT, I think. The "final" estimates are not a combination of the estimates from the models that are fitted during the cross-validation - there is no such a relationship between them.

This is why: During CV, you fit separate models on various folds of the data (i.e. each model is fitted on different data sub-set) and the estimates are the the optimal "solution" on that data (details here). The "final fit" is estimated on the entire sample I assume. Differences in the training data lead indeed to differences in estimates but you can't expect the "final" estimates to be derivable from the CV fits; just consider the fit is performed via a non-linear and often complicated function.

My suggestion: use the CV fits to see the distribution of the coefficients; compare the final esitmates with them; and examine the performance of each CV model. This will help you to valdiate your model and its selection.