1
votes

I am totally new to machine learning and I'm trying to use scikit-learn to make a simple logistic regression model with 1 input variable (X) and a binary outcome (Y). My data consists of 325 samples, with 39 successes and 286 failures. The data was split into a training and test (30%) set.

My goal is actually to obtain the predicted probabilities of success for any given X based on my data, not for classification prediction per se. That is, I will be taking the predicted probabilities for use in a separate model I'm building and won't be using the logistic regression as a classifier at all. So it's important that the predicted probabilities actually fit the data.

However, I am having some trouble understanding whether or not my model is a good fit to the data, or if the computed probabilities are actually accurate.

I am getting the following metrics:

  • Classification accuracy: metrics.accuracy_score(Y_test, predicted) = 0.92. My understanding of this metric is that the model has a high chance of making correct predictions, so it looks to me like the model is a good fit.

  • Log loss: cross_val_score(LogisticRegression(), X, Y, scoring='neg_log_loss', cv=10) = -0.26 This is probably the most confusing metric for me, and apparently the most important as it is the accuracy of the predicted probabilities. I know that the closer to zero the score is the better - but how close is close enough?

  • AUC: metrics.roc_auc_score(Y_test, probs[:, 1]) = 0.9. Again, this looks good, since the closer the ROC score is to 1 the better.

  • Confusion Matrix: metrics.confusion_matrix(Y_test, predicted) =

            [  88,  0]
               [8,  2]
    

    My understanding here is that the diagonal gives the numbers of correct predictions in the training set so this looks ok.

  • Report: metrics.classification_report(Y_test, predicted) =

                precision    recall  f1-score   support
    
    0.0       0.92      1.00      0.96        88
    1.0       1.00      0.20      0.33        10
    
    avg / total       0.93      0.92      0.89        98
    

    According to this classification report, the model has good precision so it is a good fit. I am not sure how to interpret the recall or if this report is bad news for my model- the sklearn documentation states that the recall is a models ability to find all positive samples - so a score of 0.2 for a prediction of 1 would mean that it only finds the positives 20% of the time? That sounds like a really bad fit to the data.

I'd really appreciate if someone could clarify that I am interpeting these metrics the right way - and perhaps shed some light on whether my model is good or bogus. Also, if there are any other tests I could do to determine if the computed probabilities are accurate please let me know.

If these aren't good metric scores, I'd really appreciate some direction on where to go next in terms of improvement.

Thanks!!

2

2 Answers

5
votes

Your data set in unbalanced since there are far more failures than successes. A classifier that just guesses failure all the time would get 86%, so 92% precision isn't that impressive.

Then confusion matrix shows what's happening. 88 times it correctly predicts failure and 8 times it incorrectly predicts failure. Only twice does it actually predict success correctly.

Precision is the number of guesses it makes that are correct: so (88 + 2)/98 = 0.92% overall. The recall for success is only 2 out of the (8+2) total successes (or 20%).

So the model isn't a great fit. There are many ways to deal with unbalanced data sets like weighting the examples or applying a prior to the predictions. The confusion matrix is a good way to see what's really happening.

2
votes

Your data suffers from class imbalance problem. You have not specified any way that to deal with it while training your classifier. However, even though your accuracy is high, it might be because the number of Failure samples is quite large and hence your test set might be populated with it too.

To deal with it you can use Stratified split in sklearn to shuffle and split your data to account for class imbalance problem.

You can also try other techniques to improve your classifier like GridSearch as well. You can read more about model evaluation here in this link. For model specific cross-validation techniques check this section in sklearn..

One more thing you can do is that instead of using accuracy as a metric for training your classifier, you can focus on recall and precision( or even True Positive rate in your case). You will need to use make_scorer in sklearn. An example can be found here and here. You might also want to checkout F1-score or F_beta score as well.

You can also checkout this Github repository for various sampling techniques to tackle class imbalance problem in sklearn.

You can also checkout this answer as well for more techniques.