0
votes

After building the Classification model, I evaluated it by means of accuracy, precision and recall. To check over fitting I used K Fold Cross Validation. I am aware that if my model scores vary greatly from my cross validation scores then my model is over fitting. However, am stuck with how to define the threshold. Like how much difference in the scores will actually infer that the model is over fitting. For example, here are 3 splits (3 Fold CV, shuffle= True, random_state= 42) and their respective scores upon a Logistic Regression model:

Split Number  1
Accuracy= 0.9454545454545454
Precision= 0.94375
Recall= 1.0

Split Number  2
Accuracy= 0.9757575757575757
Precision= 0.9753086419753086
Recall= 1.0

Split Number  3
Accuracy= 0.9695121951219512
Precision= 0.9691358024691358
Recall= 1.0  

Direct training of the Logistic Regression model without CV:

Accuracy= 0.9530201342281879
Precision= 0.952054794520548
Recall= 1.0

So how do I decide by what magnitude my scores need to vary in order to infer an over fitting case?

1

1 Answers

3
votes

I would assume that you are using Cross-validation:

enter image description here

Which will split your training and test data.

Right now you have probably something like this implemented:

from sklearn.model_selection import cross_validate
from sklearn.metrics import recall_score
scoring = ['precision_macro', 'recall_macro']
clf = svm.SVC(kernel='linear', C=1, random_state=0)
scores = cross_validate(clf, iris.data, iris.target, scoring=scoring,cv=5)

So right now you are calculating only the test score, which in all 3 cases is very good.

The first option is:

return_train_score is set to False by default to save computation time. To evaluate the scores on the training set as well you need to be set to True

There you can also see the training scores of your folds. If you would see 1.0 accuracy for training sets, this is overfitting.

The other option is: Run more splits. Then you are sure that the algorithm is not overfitting, if every test score has a high accuracy you are doing good.

Did you add a baseline? I would assume it is binary classifcation, and I have the feeling the datset is highly imbalanced, so 0.96 accuarcy is not so good in general maybe, because your dummy classification (always one class) would have 0.95 accuracy.