I'm using Scikit-Learn's Logistic Regression algorithm to perform digit classification. The dataset I'm using is Scikit-Learn's load_digits.
Below is a simplified version of my code:
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import learning_curve
from sklearn.datasets import load_digits
digits = load_digits()
model = LogisticRegression(solver ='lbfgs',
penalty = 'none',
max_iter = 1e5,
multi_class = 'auto')
model.fit(digits.data, digits.target)
predictions = model.predict(digits.data)
df_cm = pd.DataFrame(confusion_matrix(digits.target, predictions))
ax = sns.heatmap(df_cm, annot = True, cbar = False, cmap = 'Blues_r', fmt='d', annot_kws = {"size": 10})
ax.set_ylim(0,10)
plt.title("Confusion Matrix")
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.show()
train_size = [0.2, 0.4, 0.6, 0.8, 1]
training_size, training_score, validation_score = learning_curve(model, digits.data, digits.target, cv = 5,
train_sizes = train_size, scoring = 'neg_mean_squared_error')
training_scores_mean = - training_score.mean(axis = 1)
validation_score_mean = - validation_score.mean(axis = 1)
plt.plot(training_size, validation_score_mean)
plt.plot(training_size, training_scores_mean)
plt.legend(["Validation error", "Training error"])
plt.ylabel("MSE")
plt.xlabel("Training set size")
plt.show()
### EDIT ###
# With L2 regularization
model = LogisticRegression(solver ='lbfgs',
penalty = 'l2', # Changing penality to l2
max_iter = 1e5,
multi_class = 'auto')
model.fit(digits.data, digits.target)
predictions = model.predict(digits.data)
df_cm = pd.DataFrame(confusion_matrix(digits.target, predictions))
ax = sns.heatmap(df_cm, annot = True, cbar = False, cmap = 'Blues_r', fmt='d', annot_kws = {"size": 10})
ax.set_ylim(0,10)
plt.title("Confusion Matrix with L2 regularization")
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.show()
training_size, training_score, validation_score = learning_curve(model, digits.data, digits.target, cv = 5,
train_sizes = train_size, scoring = 'neg_mean_squared_error')
training_scores_mean = - training_score.mean(axis = 1)
validation_score_mean = - validation_score.mean(axis = 1)
plt.plot(training_size, validation_score_mean)
plt.plot(training_size, training_scores_mean)
plt.legend(["Validation error", "Training error"])
plt.title("Learning curve with L2 regularization")
plt.ylabel("MSE")
plt.xlabel("Training set size")
plt.show()
# With L2 regularization and best C
from sklearn.model_selection import GridSearchCV
C = {'C': [1e-3, 1e-2, 1e-1, 1, 10]}
model_l2 = GridSearchCV(LogisticRegression(random_state = 0, solver ='lbfgs', penalty = 'l2', max_iter = 1e5, multi_class = 'auto'),
param_grid = C, cv = 5, iid = False, scoring = 'neg_mean_squared_error')
model_l2.fit(digits.data, digits.target)
best_C = model_l2.best_params_.get("C")
print(best_C)
model_reg = LogisticRegression(solver ='lbfgs',
penalty = 'l2',
C = best_C,
max_iter = 1e5,
multi_class = 'auto')
model_reg.fit(digits.data, digits.target)
predictions = model_reg.predict(digits.data)
df_cm = pd.DataFrame(confusion_matrix(digits.target, predictions))
ax = sns.heatmap(df_cm, annot = True, cbar = False, cmap = 'Blues_r', fmt='d', annot_kws = {"size": 10})
ax.set_ylim(0,10)
plt.title("Confusion Matrix with L2 regularization and best C")
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.show()
training_size, training_score, validation_score = learning_curve(model_reg, digits.data, digits.target, cv = 5,
train_sizes = train_size, scoring = 'neg_mean_squared_error')
training_scores_mean = - training_score.mean(axis = 1)
validation_score_mean = - validation_score.mean(axis = 1)
plt.plot(training_size, validation_score_mean)
plt.plot(training_size, training_scores_mean)
plt.legend(["Validation error", "Training error"])
plt.title("Learning curve with L2 regularization and best C")
plt.ylabel("MSE")
plt.xlabel("Training set size")
plt.show()
As can be seen from the confusion matrix for the training data and from the last plot, generated using learning_curve, the error on the training set is always 0:
It seems to me that the model is massively overfitting, and I'm can't make sense out of it. I've tried this using the MNIST dataset as well, and the same thing happens.
How can I solve this?
-- EDIT --
Added above the code for L2 regularization, and with the best value for the hyperparameter C.
With L2 regularization, the model still overfits the data:
Learning Curve with L2 regularization here
With the best C hyperparameter the error on the training data is no longer zero, but the algorithm still overfits:
Learning Curve with L2 regularization here and best C here
Still don't understand what's happening...