How to determine if the predicted probabilities from sklearn logistic regresssion are accurate?

Question

I am totally new to machine learning and I'm trying to use scikit-learn to make a simple logistic regression model with 1 input variable (X) and a binary outcome (Y). My data consists of 325 samples, with 39 successes and 286 failures. The data was split into a training and test (30%) set.

My goal is actually to obtain the predicted probabilities of success for any given X based on my data, not for classification prediction per se. That is, I will be taking the predicted probabilities for use in a separate model I'm building and won't be using the logistic regression as a classifier at all. So it's important that the predicted probabilities actually fit the data.

However, I am having some trouble understanding whether or not my model is a good fit to the data, or if the computed probabilities are actually accurate.

I am getting the following metrics:

Classification accuracy: metrics.accuracy_score(Y_test, predicted) = 0.92. My understanding of this metric is that the model has a high chance of making correct predictions, so it looks to me like the model is a good fit.
Log loss: cross_val_score(LogisticRegression(), X, Y, scoring='neg_log_loss', cv=10) = -0.26 This is probably the most confusing metric for me, and apparently the most important as it is the accuracy of the predicted probabilities. I know that the closer to zero the score is the better - but how close is close enough?
AUC: metrics.roc_auc_score(Y_test, probs[:, 1]) = 0.9. Again, this looks good, since the closer the ROC score is to 1 the better.
Confusion Matrix: metrics.confusion_matrix(Y_test, predicted) =
```
        [  88,  0]
           [8,  2]
```
My understanding here is that the diagonal gives the numbers of correct predictions in the training set so this looks ok.
Report: metrics.classification_report(Y_test, predicted) =
```
            precision    recall  f1-score   support

0.0       0.92      1.00      0.96        88
1.0       1.00      0.20      0.33        10

avg / total       0.93      0.92      0.89        98
```
According to this classification report, the model has good precision so it is a good fit. I am not sure how to interpret the recall or if this report is bad news for my model- the sklearn documentation states that the recall is a models ability to find all positive samples - so a score of 0.2 for a prediction of 1 would mean that it only finds the positives 20% of the time? That sounds like a really bad fit to the data.

I'd really appreciate if someone could clarify that I am interpeting these metrics the right way - and perhaps shed some light on whether my model is good or bogus. Also, if there are any other tests I could do to determine if the computed probabilities are accurate please let me know.

If these aren't good metric scores, I'd really appreciate some direction on where to go next in terms of improvement.

Thanks!!

Ryan Jay Ryan Jay · Accepted Answer · 2017-09-24T16:57:33

Your data set in unbalanced since there are far more failures than successes. A classifier that just guesses failure all the time would get 86%, so 92% precision isn't that impressive.

Then confusion matrix shows what's happening. 88 times it correctly predicts failure and 8 times it incorrectly predicts failure. Only twice does it actually predict success correctly.

Precision is the number of guesses it makes that are correct: so (88 + 2)/98 = 0.92% overall. The recall for success is only 2 out of the (8+2) total successes (or 20%).

So the model isn't a great fit. There are many ways to deal with unbalanced data sets like weighting the examples or applying a prior to the predictions. The confusion matrix is a good way to see what's really happening.

How to determine if the predicted probabilities from sklearn logistic regresssion are accurate?

2 Answers