2
votes

I've used sci-kit learn to build a random forest model to predict insurance renewals. This is tricky because, in my data set, 96.24% renew while only 3.76% do not renew. After I ran the model I evaluated model performance with a confusion matrix, classification report, and ROC curve.

[[  2448   8439]
 [     3 278953]]


             precision    recall  f1-score   support

          0       1.00      0.22      0.37     10887
          1       0.97      1.00      0.99    278956

avg / total       0.97      0.97      0.96    289843

My ROC curve looks like this:

enter image description here

The model predicted renewals at just a hair under 100% (rounded to 1.00, see recall column) and non-renewals at about 22% (see recall column). The ROC curve would suggest an area under the curve of much greater than what is indicated in the bottom-right portion of the plot (area = 0.61).

Does anyone understand why this is happening?

Thank you!

1
This question is more suitable for stats.stackexchange.com. I agree that one can see, by inspection, than the area under the curve must be greater than 0.61, so I don't know where that number is coming from. However, perhaps it's possible the smooth curve is not an accurate representation of the actual ROC -- perhaps the actual ROC is not a smooth curve but some lumpy curve, such that its area is actually 0.61 after all. My advice is to get the scores and actual labels and construct the ROC yourself and compare it.Robert Dodier

1 Answers

1
votes

In cases where the classes are highly imbalanced, ROC turns out to be an inappropriate metric. A better metric would be to use average precision or area under the PR curve.

This supporting Kaggle link talks about the exact same issue in a similar problem setting.

This answer and the linked paper explain that the optimizing for the best area under PR curve will also give the best ROC.