Plotting the ROC curve of K-fold Cross Validation

Question

I am working with an imbalanced dataset. I have applied SMOTE Algorithm to balance the dataset after splitting the dataset into test and training set before applying ML models. I want to apply cross-validation and plot the ROC curves of each folds showing the AUC of each fold and also display the mean of the AUCs in the plot. I named the resampled training set variables as X_train_res and y_train_res and following is the code:

cv = StratifiedKFold(n_splits=10)
classifier = SVC(kernel='sigmoid',probability=True,random_state=0)

tprs = []
aucs = []
mean_fpr = np.linspace(0, 1, 100)
plt.figure(figsize=(10,10))
i = 0
for train, test in cv.split(X_train_res, y_train_res):
    probas_ = classifier.fit(X_train_res[train], y_train_res[train]).predict_proba(X_train_res[test])
    # Compute ROC curve and area the curve
    fpr, tpr, thresholds = roc_curve(y_train_res[test], probas_[:, 1])
    tprs.append(interp(mean_fpr, fpr, tpr))
    tprs[-1][0] = 0.0
    roc_auc = auc(fpr, tpr)
    aucs.append(roc_auc)
    plt.plot(fpr, tpr, lw=1, alpha=0.3,
             label='ROC fold %d (AUC = %0.2f)' % (i, roc_auc))

    i += 1
plt.plot([0, 1], [0, 1], linestyle='--', lw=2, color='r',
         label='Chance', alpha=.8)

mean_tpr = np.mean(tprs, axis=0)
mean_tpr[-1] = 1.0
mean_auc = auc(mean_fpr, mean_tpr)
std_auc = np.std(aucs)
plt.plot(mean_fpr, mean_tpr, color='b',
         label=r'Mean ROC (AUC = %0.2f $\pm$ %0.2f)' % (mean_auc, std_auc),
         lw=2, alpha=.8)

std_tpr = np.std(tprs, axis=0)
tprs_upper = np.minimum(mean_tpr + std_tpr, 1)
tprs_lower = np.maximum(mean_tpr - std_tpr, 0)
plt.fill_between(mean_fpr, tprs_lower, tprs_upper, color='grey', alpha=.2,
                 label=r'$\pm$ 1 std. dev.')

plt.xlim([-0.01, 1.01])
plt.ylim([-0.01, 1.01])
plt.xlabel('False Positive Rate',fontsize=18)
plt.ylabel('True Positive Rate',fontsize=18)
plt.title('Cross-Validation ROC of SVM',fontsize=18)
plt.legend(loc="lower right", prop={'size': 15})
plt.show()

following is the output:

Please tell me whether the code is correct for plotting ROC curve for the cross-validation or not.

desertnaut desertnaut · Accepted Answer · 2019-08-29T17:17:35

The problem is that I do not clearly understand cross-validation. In the for loop range, I have passed the training sets of X and y variables. Does cross-validation work like this?

Leaving SMOTE and the imbalance issue aside, which are not included in your code, your procedure looks correct.

In more detail, for each one of your n_splits=10:

you create train and test folds

you fit the model using the train fold:

  classifier.fit(X_train_res[train], y_train_res[train])

and then you predict probabilities using the test fold:
```
   predict_proba(X_train_res[test])
```

This is exactly the idea behind cross-validation.

So, since you have n_splits=10, you get 10 ROC curves and respective AUC values (and their average), exactly as expected.

However:

The need for (SMOTE) upsampling due to the class imbalance changes the correct procedure, and turns your overall process incorrect: you should not upsample your initial dataset; instead, you need to incorporate the upsampling procedure into the CV process.

So, the correct procedure here for each one of your n_splits becomes (notice that starting with a stratified CV split, as you have done, becomes essential in class imbalance cases):

create train and test folds
upsample your train fold with SMOTE
fit the model using the upsampled train fold
predict probabilities using the test fold (not upsampled)

For details regarding the rationale, please see own answer in the Data Science SE thread Why you shouldn't upsample before cross validation.

Plotting the ROC curve of K-fold Cross Validation

1 Answers