Logistic regression sklearn - train and apply model

Question

I'm new to machine learning and trying Sklearn for the first time. I have two dataframes, one with data to train a logistic regression model (with 10-fold cross-validation) and another one to predict classes ('0,1') using that model. Here's my code so far using bits of tutorials I found on Sklearn docs and on the Web:

import pandas as pd
import numpy as np
import sklearn
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold
from sklearn.preprocessing import normalize
from sklearn.preprocessing import scale
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
from sklearn import metrics


# Import dataframe with training data
df = pd.read_csv('summary_44.csv')
cols = df.columns.drop('num_class') # Data to use (num_class is the column with the classes)

# Import dataframe with data to predict
df_pred = pd.read_csv('new_predictions.csv')

# Scores
df_data = df.ix[:,:-1].values

# Target
df_target = df.ix[:,-1].values

# Values to predict
df_test = df_pred.ix[:,:-1].values

# Scores' names
df_data_names = cols.values

# Scaling
X, X_pred, y = scale(df_data), scale(df_test), df_target

# Define number of folds
kf = KFold(n_splits=10)
kf.get_n_splits(X) # returns the number of splitting iterations in the cross-validator

# Logistic regression normalizing variables
LogReg = LogisticRegression()

# 10-fold cross-validation
scores = [LogReg.fit(X[train], y[train]).score(X[test], y[test]) for train, test in kf.split(X)]
print scores

# Predict new
novel = LogReg.predict(X_pred)

Is this the correct way to implement a Logistic Regression? I know that the fit() method should be used after cross-validation in order to train the model and use it for predictions. However, since I called fit() inside a list comprehension I really don't know if my model was "fitted" and can be used to make predictions.

Andrey Lukyanenko Andrey Lukyanenko · Accepted Answer · 2017-11-18T04:05:13

I general things are okay, but there are some problems.

Scaling

X, X_pred, y = scale(df_data), scale(df_test), df_target

You scale training and test data independently, which isn't correct. Both datasets must be scaled with the same scaler. "Scale" is a simple function, but it is better to use something else, for example StandardScaler.

scaler = StandardScaler()
scaler.fit(df_data)
X = scaler.transform(df_data)
X_pred = scaler.transform(df_test)

Cross-validation and predicting. How your code works? You split data 10 times into train and hold-out set; 10 times fit model on train set and calculate score on hold-out set. This way you get cross-validation scores, but the model is fitted only on a part of data. So it would be better to fit model on the whole dataset and then make a prediction:
```
LogReg.fit(X, y)
novel = LogReg.predict(X_pred)
```

I want to notice that there are advanced technics like stacking and boosting, but if you learn using sklearn, then it is better to stick to the basics.

Logistic regression sklearn - train and apply model

1 Answers