Scikit-learn - Bad input shape error on multinomial logistic regression

Question

I'm implementing a multinomial logistic regression model in Python using Scikit-learn. Here's my code:

X = pd.concat([each for each in feature_cols], axis=1)
y = train[["<5", "5-6", "6-7", "7-8", "8-9", "9-10"]]
lm = LogisticRegression(multi_class='multinomial', solver='lbfgs')
lm.fit(X, y)

However, I'm getting ValueError: bad input shape (50184, 6) when it tries to execute the last line of code.

X is a DataFrame with 50184 rows, 7 columns. y also has 50184 rows, but 6 columns.

I ultimately want to predict in what bin (<5, 5-6, etc.) the outcome falls. All the independent and dependent variables used in this case are dummy columns which have a binary value of either 0 or 1. What am I missing?

The docs ask for a vector input for Y. Perhaps you should try coding your training variables as different values instead of dummies? — Stefan

Stefan Stefan · Accepted Answer · 2015-12-01T04:10:33

The Logistic Regression 3-class Classifier example illustrates how fitting LogisticRegression uses a vector rather than a matrix input, in this case the target variable of the iris dataset, coded as values [0, 1, 2].

To convert the dummy matrix to a series, you could multiply each column with a different integer, and then - assuming it's a pandas.DataFrame - just call .sum(axis=1) on the result. Something like:

for i, col in enumerate(y.columns.tolist(), 1):
    y.loc[:, col] *= i
y = y.sum(axis=1)

Scikit-learn - Bad input shape error on multinomial logistic regression

1 Answers