How to find the regularization parameter in logistic regression in python scikit-learn?

Question

In scikit-learn, the C is the inverse of regularization strength (link). I have manually computed three training with the same parameters and conditions except I am using three different C's (i.e. 0.1, 1.0, and 10.0). I compared the F-score in the validation set, and identify the "best" C. However, someone told me this is wrong as I am not supposed to use the validation set to optimize C. How should I pick the right C? And what justification I have if I am to choose the default C (= 1.0) from scikit-learn?

lejlot lejlot · Accepted Answer · 2016-10-11T19:34:40

How should I pick the right C?

You are supposed to have three-folded dataset: training, validation and testing. You train on train, set hyperparameters on validation and finally evaluate on test. In particular, when data is small you can do this with k-fold CV fashion, where you first employ CV for train-test splits, and then yet another one inside, which splits train further to actual train and validation.

And what justification I have if I am to choose the default C (= 1.0) from scikit-learn?

There is no justification besides putting an arbitrary prior on weights (thus any other value would be equally justified).

How to find the regularization parameter in logistic regression in python scikit-learn?

1 Answers