11
votes

This post is about the differences between LogisticRegressionCV, GridSearchCV and cross_val_score. Consider the following setup:

import numpy as np
from sklearn.datasets import load_digits
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.model_selection import train_test_split, GridSearchCV, \
     StratifiedKFold, cross_val_score
from sklearn.metrics import confusion_matrix

read = load_digits()
X, y = read.data, read.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)

In penalized logistic regression, we need to set the parameter C which controls regularization. There are 3 ways in scikit-learn to find the best C by cross validation.

LogisticRegressionCV

clf = LogisticRegressionCV (Cs = 10, penalty = "l1",
    solver = "saga", scoring = "f1_macro")
clf.fit(X_train, y_train)
confusion_matrix(y_test, clf.predict(X_test))

Side note: The documentation states that SAGA and LIBLINEAR are the only optimizers for L1 penalty, and SAGA is faster for large datasets. Unfortunately, warm starting is available for Newton-CG and LBFGS only.

GridSearchCV

clf = LogisticRegression (penalty = "l1", solver = "saga", warm_start = True)
clf = GridSearchCV (clf, param_grid = {"C": np.logspace(-4, 4, 10)}, scoring = "f1_macro")
clf.fit(X_train, y_train)
confusion_matrix(y_test, clf.predict(X_test))
result = clf.cv_results_

cross_val_score

cv_scores = {}
for val in np.logspace(-4, 4, 10):
    clf = LogisticRegression (C = val, penalty = "l1",
        solver = "saga", warm_start = True)
    cv_scores[val] = cross_val_score (clf, X_train, y_train,
        cv = StratifiedKFold(), scoring = "f1_macro").mean()

clf = LogisticRegression (C = max(cv_scores, key = cv_scores.get),
        penalty = "l1", solver = "saga", warm_start = True)
clf.fit(X_train, y_train)
confusion_matrix(y_test, clf.predict(X_test))

Questions

  1. Have I performed cross validation correctly in 3 ways?
  2. Are all 3 ways equivalent? If not, can they be made equivalent by changing the code?
  3. Which way is the best in terms of elegance, speed or any criteria? (In other words, why are there 3 ways of cross validation in scikit-learn?)

Non-trivial answers to any one question are welcome; I realize they are a bit long but they are hopefully a good summary of hyperparameter selection in scikit-learn.

1

1 Answers

2
votes

Regarding 3 - Why are there 3 ways of cross validation in scikit-learn?

Lets look at this in analogy to clustering: Multiple clustering algorithms are implemented in scikit-learn.

Why so? Is not one better than the other?

You might answer: Well they are different algorithms each with their own advantages and disadvantages.

LogisticRegressionCV

implements Logistic Regression with built-in cross-validation support, to find the optimal C and l1_ratio parameters according to the scoring attribute.

LogisticRegressionCV is thus an "advanced" version of Logistic Regression since it does not require the user to optimize the hyperparameters C l1_ratio himself.

GridSearchCV

The user guide states that:

The grid search provided by GridSearchCV exhaustively generates candidates from a grid of parameter values specified with the param_grid parameter. For instance, the following

param_grid = [
  {'C': [1, 10, 100, 1000], 'kernel': ['linear']},
  {'C': [1, 10, 100, 1000], 'gamma': [0.001, 0.0001], 'kernel': ['rbf']},
 ]

Here you can actually specify both the parameters over which you want to do grid search as well as the values/steps. Compared with LogisticRegressionCV, the main difference is that GridSearchCV can be used for any classifier/regressor. Most important, you can also use GridSearchCV for any models that are not on sklearn, as long as they have both fit and predict methods.

In addition to providing the model that performed best by using such as:

clf = GridSearchCV (clf, param_grid = {"C": np.logspace(-4, 4, 10)}, scoring = "f1_macro")
clf.fit(X_train, y_train)

GridSearchCV also contains an extensive evaluation of the best model:

cv_results_ : dict of numpy (masked) ndarrays A dict with keys as column headers and values as columns, that can be imported into a pandas DataFrame.

cross_val_score

You might want to evaluate your model specifically on a holdout dataset. Without search over parameters, you evaluate a single model. This is when you use cross_val_score.

TLDR: All are different methods and each are used for a different purpose. LogisticRegressionCV is only relevant for logistic regression. GridSearchCV is the most exhaustive and generalized variant which includes both evaluation scores as well as the optimal classifier. cross_val_score is only an evaluation and preferred to use when only evaluating.