2
votes

In scikit-learn, GridSearchCV() support 'roc_auc' as a scoring function. It works well with n-fold cross-validation, but if I use LeaveOneOut, it does not work and generate error message.

ValueError: Only one class present in Y. ROC AUC score is not defined in that case.

Although it seems natural that drawing with AUC with only one sample is not possible, other language such as R supports roc_auc for LeaveOneOut.

How can I calculate with python and scikit-learn? If it is impossible, will using large-fold cross validation result like it?

1
Are you trying to plot multiclass Roc Curve for a single class model? have you read this ? - pazitos10
The problem with leave one out cross validation is GridSearchCV calculates the score over each fold and then reports the average. With leave one out, it is impossible to generate a score for an individual sample. - David Maust
Thank you for your answers. So GridSearchCV() cannot be done with LeaveOneOut. Then, is there any other way to caculate roc_auc score with all samples changing parameters instead of GridSearchCV? - z991

1 Answers

1
votes

As pointed out by David Maust, the problem with leave one out cross validation is GridSearchCV calculates the score over each fold and then reports the average.

In order to obtain a meaningful ROC AUC with LeaveOneOut, you need to calculate probability estimates for each fold (each consisting of just one observation), then calculate the ROC AUC on the set of all these probability estimates.

This can be done as follows:

def LeaveOneOut_predict_proba(clf, X, y, i):
    clf.fit(X.drop(i), y.drop(i))
    return clf.predict_proba(X.loc[[i]])[0, 1]

# set clf, param_grid, X, y

for params in ParameterGrid(param_grid):
    print(params)
    clf.set_params(**params)
    y_proba = [LeaveOneOut_predict_proba(clf, X, y, i) for i in X.index]
    print(roc_auc_score(y, y_proba))

Sample output:

{'n_neighbors': 5, 'p': 1, 'weights': 'uniform'}
0.6057986111111112
{'n_neighbors': 5, 'p': 1, 'weights': 'distance'}
0.620625
{'n_neighbors': 5, 'p': 2, 'weights': 'uniform'}
0.5862499999999999

Since this does not use the infrastructure of GridSearchCV, you will need to implement picking the maximal score and parallelization (if necessary) yourself.