Cross validation and model selection

Question

I am using sklearn for SVM training. I am using the cross-validation to evaluate the estimator and avoid the overfitting model.

I split the data into two parts. Train data and test data. Here is the code:

import numpy as np
from sklearn import cross_validation
from sklearn import datasets
from sklearn import svm

X_train, X_test, y_train, y_test = cross_validation.train_test_split(
    iris.data, iris.target, test_size=0.4, random_state=0
)
clf = svm.SVC(kernel='linear', C=1)
scores = cross_validation.cross_val_score(clf, X_train, y_train, cv=5)
print scores

Now I need to evaluate the estimator clf on X_test.

clf.score(X_test, y_test)

here, I get an error saying that the model is not fitted using fit(), but normally, in cross_val_score function the model is fitted? What is the problem?

When doing cross-validation you would train your model on X_train, y_train, then evaluate its performance on X_test, y_test. It wouldn't make sense to evaluate the performance of your classifier without training it first. — ali_m
@ali_m, What cross_validation.cross_val_score() does? Normally it train the mode first. I see and understand very well what are you saying. For my case, I need a kind of early stopping to avoid the overfitting. I split the dataset onto 3 parts for examples train, valid, test. I train the model on the train part after that I tune it on valid part. Once I obtain a reasonable train and valid error. I test it on test part. That is it! — Jeanne

ali_m ali_m · Accepted Answer · 2016-02-16T09:58:49

cross_val_score is basically a convenience wrapper for the sklearn cross-validation iterators. You give it a classifier and your whole (training + validation) dataset and it automatically performs one or more rounds of cross-validation by splitting your data into random training/validation sets, fitting the training set, and computing the score on the validation set. See the documentation here for an example and more explanation.

The reason why clf.score(X_test, y_test) raises an exception is because cross_val_score performs the fitting on a copy of the estimator rather than the original (see the use of clone(estimator) in the source code here). Because of this, clf remains unchanged outside of the function call, and is therefore not properly initialized when you call clf.fit.

Cross validation and model selection

1 Answers