4
votes

I am attempting to perform a logistic regression on a dataset which contains a target variable which is boolean ('default'), and two features ('fico_interp', 'home_ownership_int') using logit module in statsmodels. All three values are from the same data frame, 'traindf':

from sklearn import datasets
import statsmodels.formula.api as smf

lmf = smf.logit('default ~ fico_interp + home_ownership_int',traindf).fit()

Which generates an error message:

ValueError: operands could not be broadcast together with shapes (40406,2) (40406,)

How can this happen?

1
one of the columns fico_interp or home_ownership_int is a (x,2) array. try to visualize themfarhawa
My guess is that the boolean target variable doesn't work. Try to convert it to int. patsy treats the boolean as categorical variable and converts it to a 2 dimensional response variable which doesn't work for Logit. There should be already an open issue for this in statsmodels, but there is no solution yet.Josef
@wajdi Hi Wajdi - that doesn't appear to solve the problem. home_ownership_int is indeed a categorical variable, but when I substitute a continuous variable, I get the same error message. I also note that each variable is a dtype 'object' with the same dimensions - (40407,)GPB

1 Answers

4
votes

The problem is that traindf['default'] contains values that are not numeric.

The following code reproduces the error:

import pandas as pd, numpy as np, statsmodels.formula.api as smf
df = pd.DataFrame(np.random.randn(1000,2), columns=list('AB'))
df['C'] = ((df['B'] > 0)*1).apply(str)
lmf = smf.logit('C ~ A', df).fit()

And the following code is a possible way to fix this instance:

df.replace(to_replace={'C' : {'1': 1, '0': 0}}, inplace = True)
lmf = smf.logit('C ~ A', df).fit()

This post reports an analogous issue.