2
votes

I have been using http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.cross_val_score.html

in order to cross validate a Logistic Regression classifier. The results I got are:

[ 0.78571429  0.64285714  0.85714286  0.71428571  
0.78571429  0.64285714    0.84615385  0.53846154  
0.76923077  0.66666667]

My primary question is how I could find which set/fold maximises my classifier's score and produces 0.857.

Follow-up question: Is training my classifier with this set a good practice?

Thank you in advance.

1
You seem to have a lot of questions here. I'd recommend picking the most important one and asking that. The other questions might get answered accidentally. :)erip
Just in case, "Logistic Regression" is not a classifier per se.Sergey Bushmanov

1 Answers

7
votes

whether and how I could find which set/fold maximises my classifier's score

From the documentation of cross_val_score, you can see that it operates on a specific cv object. (If you do not give it explicitly, then it will be KFold in some cases, other things in other cases - refer to the documentation there.)

You can iterate over this object (or an identical one) to find the exact train/test indices. E.g., :

for tr, te in KFold(10000, 3):
    # tr, te in each iteration correspond to those which gave you the scores you saw.

whether training my classifier with this set is a good practice.

Absolutely not!

The only legitimate uses of cross validation is for things like assessing overall performance, choosing between different models, or configuring model parameters.

Once you are committed to a model, you should train it over the entire training set. It is completely wrong to train it over the subset which happened to give the best score.