0
votes

I'm plotting ROC curves for several classifiers and am stumped to find that the random forest classifier is outputting a perfect ROC curve (see below) when I'm only getting an accuracy score of 85% for class 0 and 41% for class 1 (class 1 is the positive value). enter image description here

The actual y values are y=[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 1. 0. 1. 0. 0. 1. 1. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 1. 0. 0. 0. 0. 1. 1. 1. 1. 1. 1. 0. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 0. 1. 0. 0. 1. 1. 0. 1. 0. 0. 0. 0. 1. 1. 0. 0. 0. 1. 0. 0. 1. 0. 1. 1. 0. 0. 0. 0. 1. 0. 0. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 0. 1. 1. 0. 1. 0. 1. 1. 0. 0. 1. 1. 0. 1. 1. 0. 0. 0. 0. 1. 1. 1. 1. 1. 1. 0. 0. 1. 0. 0. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 0. 0. 1. 1. 0. 0. 1. 1. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 1. 1.]

and the predicted y values=[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 1. 0. 1. 0. 0. 1. 1. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 1. 1. 1. 0. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 0. 1. 0. 0. 1. 1. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 1. 0. 0. 1. 0. 1. 1. 0. 0. 0. 0. 1. 0. 0. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 0. 1. 1. 0. 1. 0. 1. 1. 0. 0. 1. 1. 0. 1. 1. 0. 0. 0. 0. 1. 1. 1. 1. 1. 1. 0. 0. 1. 0. 0. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 0. 0. 1. 1. 0. 0. 1. 1. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 1. 1.]

The predicted probabilities are=[ 0. 0.2 0. 0. 0. 0.2 0.1 0.4 0.2 0.9 0.9 0.4 0.9 0.2 0. 0. 0. 0. 0. 0. 0.1 0. 0.6 0. 0. 0.1 0. 0.1 0.7 0. 0. 0.1 0. 0.8 0.5 0.8 0. 1. 0.2 0. 0.9 0.9 0. 0. 0. 0.7 0.4 0. 0. 0.2 0. 0. 0. 0.6 0.1 0. 0. 0.1 0.2 0. 0. 0.1 0. 0.1 0.1 0. 0.1 0. 0. 0.1 0. 0. 1. 0. 0. 0. 0.4 0. 0. 0. 0. 1. 0.9 0.9 1. 0.9 1. 0.3 0.9 0.7 0.5 0.8 1. 0.9 0.9 1. 0.7 0.9 0. 0.8 0.2 0.2 0.8 0.9 0.3 0.7 0.3 0.1 0.1 0. 0.5 0.7 0. 0.2 0.1 0.7 0. 0.4 0.9 0.2 1. 0.8 0.1 0.1 0.1 0.3 1. 0.2 0.4 0.8 0.8 0.4 0.8 1. 0.9 0.9 0.8 0.7 1. 1. 0.2 0.7 0. 0.8 0.7 0.2 0.7 0.2 0.8 0.9 0.3 0.3 1. 1. 0.2 0.7 1. 0.3 0.2 0.2 0.1 0.8 0.8 0.9 0.9 1. 0.7 0. 0. 0.7 0.4 0.1 0.2 0.7 0.9 1. 1. 0.6 0.9 0.8 0.9 0.8 0.7 0.3 0. 0.2 1. 0.9 0. 0.1 0.6 0.8 0.1 0.1 0. 0.7 0.1 0.4 0. 0.2 0.6 0.1 0. 0.7 1. ]

Finally, the code for creating the ROC is:

#Lasso
final_logit=LogisticRegression(class_weight='balanced',penalty='l1',C=best_C)
final_logit.fit(x,y)
y_pred_lass=final_logit.predict_proba(x)

fpr_lass, tpr_lass, thresholds = metrics.roc_curve(y, y_pred_lass[:,1], pos_label=1)
roc_auc_lass = auc(fpr_lass, tpr_lass)


#Logistic
wlogit = LogisticRegression(class_weight='balanced')
wlogit.fit(x,y)
y_pred_logit=wlogit.predict_proba(x)


fpr_logit, tpr_logit, thresholds = metrics.roc_curve(y, y_pred_logit[:,1], pos_label=1)
roc_auc_logit = auc(fpr_logit, tpr_logit)

#Random forest
brf = RandomForest(class_weight='balanced')
brf.fit(x,y)
y_pred_brf=brf.predict_proba(x)


fpr_brf, tpr_brf, thresholds = metrics.roc_curve(y, y_pred_brf[:,1], pos_label=1)
roc_auc_brf = auc(fpr_brf, tpr_brf)


plt.figure(figsize=(7,7))
lw = 2
plt.plot(fpr_lass, tpr_lass, color='darkorange',
     lw=lw, label='Lasso (area = %0.2f)' % roc_auc_lass)
plt.plot(fpr_logit, tpr_logit, color='red',
     lw=lw, label='Logistic (area = %0.2f)' % roc_auc_logit)
plt.plot(fpr_brf, tpr_brf, color='green',
     lw=lw, label='Random Forest (area = %0.2f)' % roc_auc_brf)
plt.plot([0, 1], [0, 1], color='black', lw=lw, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic')
plt.legend(loc="lower right")
plt.show()
1
How are you calculating the accuracy? See stackoverflow.com/questions/39145083/… for the same questionCalimo
I'm using the built-in score function that calculates mean accuracy. Re: your response on the other post, my understanding of the ROC curve is that it a graph of the performance of a classifier as its threshold is varied. Is it true that the ROC curve will be perfect as long as there exists one threshold value with perfect classification?yogz123
Yes, you are right.PhilippPro

1 Answers

0
votes

You're getting near perfect ROC-AUC score because you're calculating ROC-AUC on training set. You need to do the evaluation on test set.