I am working on a binary classification task on imbalanced data.
Since the accuracy is not so meaningful in this case. I use Scikit-Learn to compute the Precision-Recall curve and ROC curve in order to evaluate the model performance.
But I found both of the curves would be a horizontal line when I use Random Forest with a lot of estimators, it also happens when I use a SGD classifier to fit it.
The ROC chart is as following:
And the Precision-Recall chart:
Since Random Forest behaves randomly, I don't get a horizontal line in every run, sometimes I also get a regular ROC and PR curve. But the horizontal line is much more common.
Is this normal? Or I made some mistakes in my code?
Here is the snippet of my code:
classifier.fit(X_train, Y_train)
try:
scores = classifier.decision_function(X_test)
except:
scores = classifier.predict_proba(X_test)[:,1]
precision, recall, _ = precision_recall_curve(Y_test, scores, pos_label=1)
average_precision = average_precision_score(Y_test, scores)
plt.plot(recall, precision, label='area = %0.2f' % average_precision, color="green")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision Recall Curve')
plt.legend(loc="lower right")
plt.show()
Is this normal?
No. Often problems tackled with machine learning techniques are much harder. Perfect predictions are usually not possible.Or did I make some mistakes in my code?
In your code? Probably not. In your testing? Maybe. We don't know. I would suggest trying a cross validation instead. Maybe your problem is very easy to learn. Maybe your test set is problematic. A cross validation will show that. – cel