1
votes

I am working on a binary classification task on imbalanced data.

Since the accuracy is not so meaningful in this case. I use Scikit-Learn to compute the Precision-Recall curve and ROC curve in order to evaluate the model performance.

But I found both of the curves would be a horizontal line when I use Random Forest with a lot of estimators, it also happens when I use a SGD classifier to fit it.

The ROC chart is as following:

enter image description here

And the Precision-Recall chart:

enter image description here

Since Random Forest behaves randomly, I don't get a horizontal line in every run, sometimes I also get a regular ROC and PR curve. But the horizontal line is much more common.

Is this normal? Or I made some mistakes in my code?

Here is the snippet of my code:

classifier.fit(X_train, Y_train)
try:
    scores = classifier.decision_function(X_test)
except:
    scores = classifier.predict_proba(X_test)[:,1]

precision, recall, _ = precision_recall_curve(Y_test, scores, pos_label=1)
average_precision = average_precision_score(Y_test, scores)

plt.plot(recall, precision, label='area = %0.2f' % average_precision, color="green")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision Recall Curve')
plt.legend(loc="lower right")
plt.show()
3
it looks a bit too good to be ture. :-) Could you please upload your sample data file via dropbox sharelink or google driver?Jianxun Li
Take the time and think about what the plots actually tell you. You basically performed perfect predictions on the test set. Is this normal? No. Often problems tackled with machine learning techniques are much harder. Perfect predictions are usually not possible. Or did I make some mistakes in my code? In your code? Probably not. In your testing? Maybe. We don't know. I would suggest trying a cross validation instead. Maybe your problem is very easy to learn. Maybe your test set is problematic. A cross validation will show that.cel
Thank you guys! It is really helping. I will try cross validation. I will upload the data if I still can't get regular curves.Jim GB
Cel: It is truly the problem of the selection of testing data. I happen to choose an easy set of testing. That's why I got a horizontal line. Thank you!Jim GB

3 Answers

3
votes

Yes, you can. If you perfectly separate the data into two piles, then you go vertically from zero to 1 true-positive-rate without any false positives (the vertical line) as your threshold passes over your pile of true positives, then from 0 to 1 false-positive-rate as your threshold passes over your pile of true negatives.

If you can get the same ROC curve from a test set, you are golden. If you can get the same ROC curve evaluated on 5 different k-fold cross validation test sets, you are platinum.

2
votes

Along with the other answers, it's possible that you have duplicated your label as a feature in the dataset. Thus, when sampling occurs in RF, you don't always get that feature as a predictor and get a "normal-looking" ROC curve (i.e. the other features can't predict the label exactly); when you do get the duplicated label/feature in the sample, your model has 100% accuracy by definition.

SGD can have the same issue, in a way that linear regression would fail. In a linear regression, you'd have a singular/near-singular matrix and the estimation would fail. With SGD, since you're re-estimating based on each next point arriving, the math doesn't fail (though, your model will still be suspect).

0
votes

The other 2 answers are only sufficient conditions of seeing a horizontal line (aka they are the possible causes of a horizontal line, but they are not the only possiblities). Here is the necessary and sufficient conditions:

If you see a horizontal line in PR-curve, it must be at the top and it means examples in the threshold range are all TPs. And the longer the line, the more TP (because a longer line has a larger recall).

Proof:

Let's denote "TP" as true positive and "PP" as predicted positives, and therefore precision = TP/PP.

A horizontal line means recall increases by some amount and precision unchanged. Let's discuss these 2 things separately:

  1. recall increases by some amount ->
  • TP increases by some amount
  • Suppose TP increases by the smallest amount, 1. Suppose x is the amount of increase in PP. By definition x>=1.
  1. precision unchanged ->
  • (TP+1)/(PP+x)=TP/PP Solving this for x we have x=TP/PP. Because precision = TP/PP <=1, and we just said "by definition x>=1", x has to be 1.

This means both the increase in TP and PP is 1, i.e. only positive examples are added. Since x=TP/PP, we have precision TP/PP=1 as well. QED.