Perfect ROC curve but imperfect prediction accuracy

Question

I'm plotting ROC curves for several classifiers and am stumped to find that the random forest classifier is outputting a perfect ROC curve (see below) when I'm only getting an accuracy score of 85% for class 0 and 41% for class 1 (class 1 is the positive value).

The actual y values are y=[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 1. 0. 1. 0. 0. 1. 1. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 1. 0. 0. 0. 0. 1. 1. 1. 1. 1. 1. 0. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 0. 1. 0. 0. 1. 1. 0. 1. 0. 0. 0. 0. 1. 1. 0. 0. 0. 1. 0. 0. 1. 0. 1. 1. 0. 0. 0. 0. 1. 0. 0. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 0. 1. 1. 0. 1. 0. 1. 1. 0. 0. 1. 1. 0. 1. 1. 0. 0. 0. 0. 1. 1. 1. 1. 1. 1. 0. 0. 1. 0. 0. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 0. 0. 1. 1. 0. 0. 1. 1. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 1. 1.]

and the predicted y values=[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 1. 0. 1. 0. 0. 1. 1. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 1. 1. 1. 0. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 0. 1. 0. 0. 1. 1. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 1. 0. 0. 1. 0. 1. 1. 0. 0. 0. 0. 1. 0. 0. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 0. 1. 1. 0. 1. 0. 1. 1. 0. 0. 1. 1. 0. 1. 1. 0. 0. 0. 0. 1. 1. 1. 1. 1. 1. 0. 0. 1. 0. 0. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 0. 0. 1. 1. 0. 0. 1. 1. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 1. 1.]

The predicted probabilities are=[ 0. 0.2 0. 0. 0. 0.2 0.1 0.4 0.2 0.9 0.9 0.4 0.9 0.2 0. 0. 0. 0. 0. 0. 0.1 0. 0.6 0. 0. 0.1 0. 0.1 0.7 0. 0. 0.1 0. 0.8 0.5 0.8 0. 1. 0.2 0. 0.9 0.9 0. 0. 0. 0.7 0.4 0. 0. 0.2 0. 0. 0. 0.6 0.1 0. 0. 0.1 0.2 0. 0. 0.1 0. 0.1 0.1 0. 0.1 0. 0. 0.1 0. 0. 1. 0. 0. 0. 0.4 0. 0. 0. 0. 1. 0.9 0.9 1. 0.9 1. 0.3 0.9 0.7 0.5 0.8 1. 0.9 0.9 1. 0.7 0.9 0. 0.8 0.2 0.2 0.8 0.9 0.3 0.7 0.3 0.1 0.1 0. 0.5 0.7 0. 0.2 0.1 0.7 0. 0.4 0.9 0.2 1. 0.8 0.1 0.1 0.1 0.3 1. 0.2 0.4 0.8 0.8 0.4 0.8 1. 0.9 0.9 0.8 0.7 1. 1. 0.2 0.7 0. 0.8 0.7 0.2 0.7 0.2 0.8 0.9 0.3 0.3 1. 1. 0.2 0.7 1. 0.3 0.2 0.2 0.1 0.8 0.8 0.9 0.9 1. 0.7 0. 0. 0.7 0.4 0.1 0.2 0.7 0.9 1. 1. 0.6 0.9 0.8 0.9 0.8 0.7 0.3 0. 0.2 1. 0.9 0. 0.1 0.6 0.8 0.1 0.1 0. 0.7 0.1 0.4 0. 0.2 0.6 0.1 0. 0.7 1. ]

Finally, the code for creating the ROC is:

#Lasso
final_logit=LogisticRegression(class_weight='balanced',penalty='l1',C=best_C)
final_logit.fit(x,y)
y_pred_lass=final_logit.predict_proba(x)

fpr_lass, tpr_lass, thresholds = metrics.roc_curve(y, y_pred_lass[:,1], pos_label=1)
roc_auc_lass = auc(fpr_lass, tpr_lass)


#Logistic
wlogit = LogisticRegression(class_weight='balanced')
wlogit.fit(x,y)
y_pred_logit=wlogit.predict_proba(x)


fpr_logit, tpr_logit, thresholds = metrics.roc_curve(y, y_pred_logit[:,1], pos_label=1)
roc_auc_logit = auc(fpr_logit, tpr_logit)

#Random forest
brf = RandomForest(class_weight='balanced')
brf.fit(x,y)
y_pred_brf=brf.predict_proba(x)


fpr_brf, tpr_brf, thresholds = metrics.roc_curve(y, y_pred_brf[:,1], pos_label=1)
roc_auc_brf = auc(fpr_brf, tpr_brf)


plt.figure(figsize=(7,7))
lw = 2
plt.plot(fpr_lass, tpr_lass, color='darkorange',
     lw=lw, label='Lasso (area = %0.2f)' % roc_auc_lass)
plt.plot(fpr_logit, tpr_logit, color='red',
     lw=lw, label='Logistic (area = %0.2f)' % roc_auc_logit)
plt.plot(fpr_brf, tpr_brf, color='green',
     lw=lw, label='Random Forest (area = %0.2f)' % roc_auc_brf)
plt.plot([0, 1], [0, 1], color='black', lw=lw, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic')
plt.legend(loc="lower right")
plt.show()

How are you calculating the accuracy? See stackoverflow.com/questions/39145083/… for the same question — Calimo
I'm using the built-in score function that calculates mean accuracy. Re: your response on the other post, my understanding of the ROC curve is that it a graph of the performance of a classifier as its threshold is varied. Is it true that the ROC curve will be perfect as long as there exists one threshold value with perfect classification? — yogz123

Achal Dixit Achal Dixit · Accepted Answer · 2020-11-12T22:15:21

You're getting near perfect ROC-AUC score because you're calculating ROC-AUC on training set. You need to do the evaluation on test set.

Perfect ROC curve but imperfect prediction accuracy

1 Answers