0
votes

I have a 92k observation dataset and am trying to fit a logistic regression model using sklearn LogisticRegression(), however it performs poorly near the baseline auc score: .51. Weirdly, logistic regression with statsmodels Logit() method achieves an auc score of .68. Both use regularization and the two predictors are numerical, with a binary output. I got sklearn and statsmodels to closely match performance metrics and coefficients before but am struggling to figure out why sklearn doesn't perform now.

I have tried running multiple times and restarting, same result. This is a single jupyter lab code block. How do I fix sklearn to match the performance of my statsmodels model?

Sklearn Model:

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

X_train, X_test, y_train, y_test = train_test_split(df_model.drop("6MonthOutcome", axis=1), df_model['6MonthOutcome'], test_size=.2)

logit_model = LogisticRegression(max_iter=1000)
result = logit_model.fit(X_train, y_train)
y_pred = result.predict(X_test)

from sklearn.metrics import (confusion_matrix, accuracy_score)
from sklearn import metrics

fpr, tpr, thresholds = metrics.roc_curve(y_test, y_pred, pos_label=1)
print(metrics.auc(fpr, tpr))

>>> Output: 0.5050369815416016

Statsmodel Model

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

X_train, X_test, y_train, y_test = train_test_split(df_model.drop("6MonthOutcome", axis=1), df_model['6MonthOutcome'], test_size=.2)
X_train = sm.add_constant(X_train)
X_test = sm.add_constant(X_test)

logit_model = sm.Logit(y_train, X_train, maxiter=1000)
result = logit_model.fit_regularized()
y_pred = result.predict(X_test)

from sklearn.metrics import (confusion_matrix, accuracy_score)
from sklearn import metrics

fpr, tpr, thresholds = metrics.roc_curve(y_test, y_pred, pos_label=1)
print(metrics.auc(fpr, tpr))

>>> Output: 0.6813991995101205
1
Not super helpful but thanks I guess, I figured out the specific solution - Paul M

1 Answers

0
votes

The reason the sklearn model returns the baseline is because the output of logit_model.predict(X_train) is either 1 or zero, whereas the statsmodels predict returns the probability. The AUC score for sklearn's predict() will not work as intended because the ROC curve directly matches the baseline diagonal line, which is equal to random guess and 50% chance.

For my example statsmodel: predict() is equivalent to sklearn: predict_proba()[:,1]