2
votes

I am working on logistic regression using scikit learn in python. I have the data file that can be downloaded via the following link.

link for data

Below is my code for machine learning part.

from sklearn.linear_model import Lasso
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score
import pandas as pd
scaler = StandardScaler()

data = pd.read_csv('data.csv')
dataX = data.drop('outcome',axis =1).values.astype(float)
X     = scaler.fit_transform(dataX)
dataY = data[['outcome']]
Y = dataY.values

X_train,X_test,y_train,y_test = train_test_split (X,Y,test_size = 0.25, random_state = 33)
lasso = Lasso(alpha=.3)
lasso.fit(X_train,y_train)
print("MC learning completed")
print(lasso.score(X_train,y_train))
print(lasso.score(X_test,y_test))
print(lasso.coef_)

when I print coefficients, it turns out all zero. Can anyone advise me on that?

Let me explain a little bit about my objective. The problem seems to be a classification problem as we can only see 0 or 1 in Ytrain and Ytest. if we put a simple example, 0 can be considered as missed, 1 can be considered as scored. what I am trying to do is to compute the probability scoring for each event when a shot is taken place.

Thanks in advance.

Regards,

Zep

3
Hi Kumar, Thanks for the reply. I attached the data file as well. just click on the link to download it. - Zephyr
Oh I am sorry, I omitted that part. - Vivek Kumar
no problem. appreciate your help.thanks - Zephyr
I'm seeing a Lasso model being used instead of a logistic regression. Lasso is used for regression rather than classification. - Scratch'N'Purr
Hi Kumar, I am working on regression not for classification. using the coefficients, i could be able to predict the probability of outcome. Thanks - Zephyr

3 Answers

1
votes

I just change alpha in Lasso : my result

1
votes

Your Y variable contains only 0s and 1s. If you still want to apply regression on this data then use a GridSearch for different alpha parameters.

from sklearn.linear_model import Lasso
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score
import pandas as pd
scaler = StandardScaler()

data = pd.read_csv('data.csv')
dataX = data.drop('outcome',axis =1).values.astype(float)
X     = scaler.fit_transform(dataX)
dataY = data[['outcome']]
Y = dataY.values

X_train,X_test,y_train,y_test = train_test_split (X,Y,test_size = 0.25, random_state = 33)
lasso = Lasso(alpha=.0009)
lasso.fit(X_train,y_train)
print("MC learning completed")
print(lasso.score(X_train,y_train))
print(lasso.score(X_test,y_test))
print(lasso.coef_)

Results

MC learning completed
0.37884924358295613
0.3806187071242917
[ 0.00078099  0.13397938 -0.00554932  0.00194722  0.00232949 -0.01100195
 -0.01363906  0.13031317 -0.00146605]

GridSearchCV

from sklearn.model_selection import GridSearchCV
import numpy as np

# Define the grid for the alpha parameter
parameters = {'alpha':[0.01, 0.001, 0.0005]}

# Fit it on X, Y and define the cv parameter for cross-validation
clf = GridSearchCV(lasso, parameters, cv = 3)
clf.fit(X, Y)

# Get the best parameters and model
print(clf.best_estimator_)

Note: To define a specific parameter space use: parameters = {'alpha': np.arange(0.001,1,0.02)}


EDIT 1: After taking into account the last paragraph that you just added in your question, use this:

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score
import pandas as pd
scaler = StandardScaler()

data = pd.read_csv('data.csv')
dataX = data.drop('outcome',axis =1).values.astype(float)
X     = scaler.fit_transform(dataX)
dataY = data[['outcome']]
Y = dataY.values

X_train,X_test,y_train,y_test = train_test_split (X,Y,test_size = 0.25, random_state = 33)

# Logistic Regression (aka logit, MaxEnt) classifier.
lr = LogisticRegression()
lr.fit(X_train,y_train)

# Predict the probability of the testing samples to belong to 0 or 1 class
predicted_probs = lr.predict_proba(X_test)
print(predicted_probs[0:3])

# The proba of the first testing sample to belong to class 0 is 0.8704 and to class 1 0.1295
[[0.87046267 0.12953733]
 [0.87797594 0.12202406]
 [0.80046704 0.19953296]]
0
votes

The data in Y looks like classes. They are either 0 or 1. So you should use classification algorithms and then use the coeff to get the probability.

Most scikit classifiers have a predict_proba() which you can use the get the probability directly.

If there is a need to absolutely use the regression models, then you can try LinearRegression which will use Ordinary least squares method, or LassoCV which will automatically tune the alphas to suit the need.