0
votes
file = pd.DataFrame({'name':['s', 'k', 'lo', 'ki'] , 'age':[12, 23, 32, 22], 'marks':[34, 34, 43, 22], 'score':[1, 1, 0, 1]})

I would like to run a logistic regression with the command :

import statsmodels.formula.api as smf 
logit = smf.logit( 'score ~ age + marks', file)
results = logit.fit() 

But I get a error:

"statsmodels.tools.sm_exceptions.PerfectSeparationError:
Perfect separation detected, results not available". 

I would also split the data in to train set and test set how can I do it? I have to use the predict command after this.

"glm" command in R looks much easier than Python.

1

1 Answers

1
votes

I came across a similar error too when I was working with some data. This is due to the property of the data. Since the two groups (score=0 and score=1) are perfectly separated in your data, the decision boundary can be anywhere (infinite solution). So the library is not able to return a single solution. This FIGURE shows your data. Solution 1,2,3 are all valid.

I ran this using glmnet in Matlab. The error from Matlab reads:

Warning: The estimated coefficients perfectly separate failures from successes. This means the theoretical best estimates are not finite.

Using more data points will help.

Interestingly, LogisticRegression from scikit-learn seems to work without complaints.

Example code using scikit-learn for your problem is:

import pandas as pd
import numpy as np
from patsy import dmatrices
from sklearn.linear_model import LogisticRegression

file = pd.DataFrame({'name':['s', 'k', 'lo', 'ki'] , 'age':[12, 23, 32, 22], 'marks':[34, 34, 43, 22], 'score':[1, 1, 0, 1]})
# Prepare the data
y,X = dmatrices('score ~ age + marks',file)
y = np.ravel(y)
# Fit the data to Logistic Regression model
model = LogisticRegression()
model = model.fit(X,y)

For splitting data into training and testing, you may want to refer to this: http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html