How to get the best coefficient vector using cross-validation

Question

I am running ridge regression on a dataset. I have done 5 fold cross validation. So basically my dataset is divided into 5 train and 5 test folds.

This is how I did in scikit:

from sklearn import cross_validation
k_fold=cross_validation.KFold(n=len(tourism_train_X),n_folds=5)

I set the regularisation parameter like this:

#Generating alpha values for regularization parameters
n_alphas = 200
alphas = np.logspace(-10, -1, n_alphas)

Now , my doubt is, for each train and test fold I do something like this.

ridge_tourism = linear_model.Ridge()
for a in alphas:
    ridge_tourism.set_params(alpha=a)
    index=0
    for train_indices, test_indices in k_fold:
        ridge_tourism.fit(tourism_train_X[train_indices], tourism_train_Y[train_indices])  # Fitting the model
        coefs.append(ridge_tourism.coef_)

The problem is it would give me coefficient vector for each of the five training fold within each alpha. All I want is for each alpha what is the best coefficient vector chosen. How do we get that? How do we choose out of 5 train sets which coefficient vector is finally reported for that alpha?

What do you mean by "best coefficient vector for each alpha"? — dukebody

pyan pyan · Accepted Answer · 2015-05-05T15:50:28

For each alpha value, take the mean of the validation error of the 5 folds validation. Then you will be able to get a curve for mean validation error v.s. alpha. Choose the alpha value, which gives the lowest mean validation error.

How to get the best coefficient vector using cross-validation

1 Answers