0
votes

I'm new to machine learning and trying Sklearn for the first time. I have two dataframes, one with data to train a logistic regression model (with 10-fold cross-validation) and another one to predict classes ('0,1') using that model. Here's my code so far using bits of tutorials I found on Sklearn docs and on the Web:

import pandas as pd
import numpy as np
import sklearn
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold
from sklearn.preprocessing import normalize
from sklearn.preprocessing import scale
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
from sklearn import metrics


# Import dataframe with training data
df = pd.read_csv('summary_44.csv')
cols = df.columns.drop('num_class') # Data to use (num_class is the column with the classes)

# Import dataframe with data to predict
df_pred = pd.read_csv('new_predictions.csv')

# Scores
df_data = df.ix[:,:-1].values

# Target
df_target = df.ix[:,-1].values

# Values to predict
df_test = df_pred.ix[:,:-1].values

# Scores' names
df_data_names = cols.values

# Scaling
X, X_pred, y = scale(df_data), scale(df_test), df_target

# Define number of folds
kf = KFold(n_splits=10)
kf.get_n_splits(X) # returns the number of splitting iterations in the cross-validator

# Logistic regression normalizing variables
LogReg = LogisticRegression()

# 10-fold cross-validation
scores = [LogReg.fit(X[train], y[train]).score(X[test], y[test]) for train, test in kf.split(X)]
print scores

# Predict new
novel = LogReg.predict(X_pred)

Is this the correct way to implement a Logistic Regression? I know that the fit() method should be used after cross-validation in order to train the model and use it for predictions. However, since I called fit() inside a list comprehension I really don't know if my model was "fitted" and can be used to make predictions.

1
post some data. print out df and df_dataskrubber

1 Answers

1
votes

I general things are okay, but there are some problems.

  1. Scaling

    X, X_pred, y = scale(df_data), scale(df_test), df_target
    

You scale training and test data independently, which isn't correct. Both datasets must be scaled with the same scaler. "Scale" is a simple function, but it is better to use something else, for example StandardScaler.

scaler = StandardScaler()
scaler.fit(df_data)
X = scaler.transform(df_data)
X_pred = scaler.transform(df_test)
  1. Cross-validation and predicting. How your code works? You split data 10 times into train and hold-out set; 10 times fit model on train set and calculate score on hold-out set. This way you get cross-validation scores, but the model is fitted only on a part of data. So it would be better to fit model on the whole dataset and then make a prediction:

    LogReg.fit(X, y)
    novel = LogReg.predict(X_pred)
    

I want to notice that there are advanced technics like stacking and boosting, but if you learn using sklearn, then it is better to stick to the basics.