Why over-sampling in pipeline explodes the number of model coefficients?

Question

I have a model pipeline like this:

from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# define preprocessor
preprocess = make_column_transformer(
    (StandardScaler(), ['attr1', 'attr2', 'attr3', 'attr4', 'attr5', 
                        'attr6', 'attr7', 'attr8', 'attr9']),
    (OneHotEncoder(categories='auto'), ['attrcat1', 'attrcat2'])
)

# define train and test datasets
X_train, X_test, y_train, y_test = 
    train_test_split(features, target, test_size=0.3, random_state=0)

When I execute the pipeline without over-sampling I get:

# don't do over-sampling in this case
os_X_train = X_train
os_y_train = y_train

print('Training data is type %s and shape %s' % (type(os_X_train), os_X_train.shape))
logreg = LogisticRegression(penalty='l2',solver='lbfgs',max_iter=1000)
model = make_pipeline(preprocess, logreg)
model.fit(os_X_train, np.ravel(os_y_train))
print("The coefficients shape is: %s" % logreg.coef_.shape)
print("Model coefficients: ", logreg.intercept_, logreg.coef_)
print("Logistic Regression score: %f" % model.score(X_test, y_test))

The output is:

Training data is type <class 'pandas.core.frame.DataFrame'> and shape (87145, 11)
The coefficients shape is: (1, 47)
Model coefficients:  [-7.51822124] [[ 0.10011794  0.10313989 ... -0.14138371  0.01612046  0.12064405]]
Logistic Regression score: 0.999116

Meaning I get 47 model coefficients for a training set of 87145 samples which makes sense taking into account the defined preprocessing. The OneHotEncoder works on attrcat1 and attrcat2 and they have a total of 31+7 categories which adds 38 columns plus the 9 columns I already had makes a total of 47 features.

Now if I do the same but this time over-sampling using SMOTE like this:

from imblearn.over_sampling import SMOTE
# balance the classes by oversampling the training data
os = SMOTE(random_state=0)
os_X_train,os_y_train=os.fit_sample(X_train, y_train.ravel())
os_X_train = pd.DataFrame(data=os_X_train, columns=X_train.columns)
os_y_train = pd.DataFrame(data=os_y_train, columns=['response'])

The output becomes:

Training data is type <class 'pandas.core.frame.DataFrame'> and shape (174146, 11)
The coefficients shape is: (1, 153024)
Model coefficients:  [12.02830778] [[ 0.42926969  0.14192505 -1.89354062 ...  0.008847    0.00884372 -8.15123962]]
Logistic Regression score: 0.997938

In this case I get about twice the training sample size to balance the response classes which is what I wanted but my logistic regression model explodes to 153024 coefficients. This doesn't make any sense ... any ideas why?

Hard to say without the full code what is happening here. Neither model makes sense from what I can see - you seem to have 11 features in each case, so you should have a maximum of 11 coefficients, unless you are taking feature interactions (which I don't think you are?). Most likely issue is that you're fitting the model with the wrong data (or wrong transformation of the data) both times. Is it possible to post some dummy data and full code for replicating the problem? — ajrwhite
@arjwhite You are right but 47 coefficients make sense actually. The OneHotEncoder works on attrcat1 and attrcat2 and they have a total of 31+7 categories which adds 38 columns plus the 9 columns I already had makes 47 features and this is correct. The over-sampling case can't see why it comes up with such number of coefficients. — SkyWalker
So the best way to debug this is to separate the preprocessing from the model fitting, rather than running as one pipeline. I think your preprocessing in the oversampled case is blowing up your feature space for some reason (probably you've transposed a matrix somewhere, giving you a p x n rather than n x p matrix, hence getting nearly n coefficients.) — ajrwhite

SkyWalker SkyWalker · Accepted Answer · 2019-02-04T12:49:00

OK I found the culprit for this problem. The issue is that the SMOTE converts all the feature columns to float (including those two categorical features). Therefore, when applying the columns transformer OneHotEncoder on column types float explodes the number of columns to the number of samples i.e. it sees each occurrence of the same float value as a different category.

The solution was simply to type convert those categorical columns back to int before running the pipeline:

# balance the classes by over-sampling the training data
os = SMOTE(random_state=0)
os_X_train, os_y_train = os.fit_sample(X_train, y_train.ravel())
os_X_train = pd.DataFrame(data=os_X_train, columns=X_train.columns)
# critically important to have the categorical variables from float back to int
os_X_train['attrcat1'] = os_X_train['attrcat1'].astype(int)
os_X_train['attrcat2'] = os_X_train['attrcat2'].astype(int)

Why over-sampling in pipeline explodes the number of model coefficients?

1 Answers