I've used sci-kit learn to build a random forest model to predict insurance renewals. This is tricky because, in my data set, 96.24% renew while only 3.76% do not renew. After I ran the model I evaluated model performance with a confusion matrix, classification report, and ROC curve.
[[ 2448 8439]
[ 3 278953]]
precision recall f1-score support
0 1.00 0.22 0.37 10887
1 0.97 1.00 0.99 278956
avg / total 0.97 0.97 0.96 289843
My ROC curve looks like this:
The model predicted renewals at just a hair under 100% (rounded to 1.00, see recall column) and non-renewals at about 22% (see recall column). The ROC curve would suggest an area under the curve of much greater than what is indicated in the bottom-right portion of the plot (area = 0.61).
Does anyone understand why this is happening?
Thank you!