Scikit-Learn's Logistic Regression severely overfits digit classification training data

Question

I'm using Scikit-Learn's Logistic Regression algorithm to perform digit classification. The dataset I'm using is Scikit-Learn's load_digits.

Below is a simplified version of my code:

import pandas as pd

import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import learning_curve
from sklearn.datasets import load_digits

digits = load_digits()

model = LogisticRegression(solver ='lbfgs', 
                                 penalty = 'none', 
                                 max_iter = 1e5, 
                                 multi_class = 'auto')

model.fit(digits.data, digits.target)

predictions = model.predict(digits.data)

df_cm = pd.DataFrame(confusion_matrix(digits.target, predictions))

ax = sns.heatmap(df_cm, annot = True, cbar = False, cmap = 'Blues_r', fmt='d', annot_kws = {"size": 10})

ax.set_ylim(0,10)

plt.title("Confusion Matrix")

plt.ylabel('True label')
plt.xlabel('Predicted label')

plt.show()

train_size = [0.2, 0.4, 0.6, 0.8, 1]

training_size, training_score, validation_score = learning_curve(model, digits.data, digits.target, cv = 5,
                                                                 train_sizes = train_size, scoring = 'neg_mean_squared_error')

training_scores_mean = - training_score.mean(axis = 1)
validation_score_mean = - validation_score.mean(axis = 1)

plt.plot(training_size, validation_score_mean)
plt.plot(training_size, training_scores_mean)

plt.legend(["Validation error", "Training error"])

plt.ylabel("MSE")
plt.xlabel("Training set size")

plt.show()

### EDIT ###

# With L2 regularization

model = LogisticRegression(solver ='lbfgs', 
                                 penalty = 'l2', # Changing penality to l2
                                 max_iter = 1e5, 
                                 multi_class = 'auto')

model.fit(digits.data, digits.target)
predictions = model.predict(digits.data)

df_cm = pd.DataFrame(confusion_matrix(digits.target, predictions))

ax = sns.heatmap(df_cm, annot = True, cbar = False, cmap = 'Blues_r', fmt='d', annot_kws = {"size": 10})

ax.set_ylim(0,10)

plt.title("Confusion Matrix with L2 regularization")

plt.ylabel('True label')
plt.xlabel('Predicted label')

plt.show()

training_size, training_score, validation_score = learning_curve(model, digits.data, digits.target, cv = 5,
                                                                 train_sizes = train_size, scoring = 'neg_mean_squared_error')


training_scores_mean = - training_score.mean(axis = 1)
validation_score_mean = - validation_score.mean(axis = 1)

plt.plot(training_size, validation_score_mean)
plt.plot(training_size, training_scores_mean)

plt.legend(["Validation error", "Training error"])

plt.title("Learning curve with L2 regularization")

plt.ylabel("MSE")
plt.xlabel("Training set size")

plt.show()

# With L2 regularization and best C

from sklearn.model_selection import GridSearchCV

C = {'C': [1e-3, 1e-2, 1e-1, 1, 10]}

model_l2 = GridSearchCV(LogisticRegression(random_state = 0, solver ='lbfgs', penalty = 'l2', max_iter = 1e5,  multi_class = 'auto'), 
                                     param_grid = C, cv = 5, iid = False, scoring = 'neg_mean_squared_error')

model_l2.fit(digits.data, digits.target)

best_C = model_l2.best_params_.get("C")
print(best_C)

model_reg =  LogisticRegression(solver ='lbfgs', 
                                 penalty = 'l2', 
                                 C = best_C,
                                 max_iter = 1e5, 
                                 multi_class = 'auto')

model_reg.fit(digits.data, digits.target)

predictions = model_reg.predict(digits.data)

df_cm = pd.DataFrame(confusion_matrix(digits.target, predictions))

ax = sns.heatmap(df_cm, annot = True, cbar = False, cmap = 'Blues_r', fmt='d', annot_kws = {"size": 10})

ax.set_ylim(0,10)

plt.title("Confusion Matrix with L2 regularization and best C")

plt.ylabel('True label')
plt.xlabel('Predicted label')

plt.show()

training_size, training_score, validation_score = learning_curve(model_reg, digits.data, digits.target, cv = 5,
                                                                 train_sizes = train_size, scoring = 'neg_mean_squared_error')


training_scores_mean = - training_score.mean(axis = 1)
validation_score_mean = - validation_score.mean(axis = 1)

plt.plot(training_size, validation_score_mean)
plt.plot(training_size, training_scores_mean)

plt.legend(["Validation error", "Training error"])

plt.title("Learning curve with L2 regularization and best C")

plt.ylabel("MSE")
plt.xlabel("Training set size")

plt.show()

As can be seen from the confusion matrix for the training data and from the last plot, generated using learning_curve, the error on the training set is always 0:

Learning Curve Plot Here

It seems to me that the model is massively overfitting, and I'm can't make sense out of it. I've tried this using the MNIST dataset as well, and the same thing happens.

How can I solve this?

-- EDIT --

Added above the code for L2 regularization, and with the best value for the hyperparameter C.

With L2 regularization, the model still overfits the data:

Learning Curve with L2 regularization here

With the best C hyperparameter the error on the training data is no longer zero, but the algorithm still overfits:

Learning Curve with L2 regularization here and best C here

Still don't understand what's happening...

gtancev gtancev · Accepted Answer · 2020-09-23T18:20:19

Use a regularization term (penalty) instead of 'none'.

model = LogisticRegression(solver ='lbfgs', 
                                 penalty = 'l2',
                                 max_iter = 1e5, 
                                 multi_class = 'auto')

The optimal value for C you find doing a validation curve.

Scikit-Learn's Logistic Regression severely overfits digit classification training data

1 Answers