I am trying to solve a classification problem on a given dataset, through logistic regression (and this is not the problem). To avoid overfitting I'm trying to implement it through cross-validation (and here's the problem): there's something that I'm missing to complete the program. My purpose here is to determine accuracy.
But let me be specific. This is what I've done:
- I split the set into train set and test set
- I defined the logregression prediction model to be used
- I used the cross_val_predict method (in sklearn.cross_validation) to make predictions
- Lastly, I measured accuracy
Here is the code:
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.cross_validation import train_test_split
from sklearn import metrics, cross_validation
from sklearn.linear_model import LogisticRegression
# read training data in pandas dataframe
data = pd.read_csv("./dataset.csv", delimiter=';')
# last column is target, store in array t
t = data['TARGET']
# list of features, including target
features = data.columns
# item feature matrix in X
X = data[features[:-1]].as_matrix()
# remove first column because it is not necessary in the analysis
X = np.delete(X,0,axis=1)
# divide in training and test set
X_train, X_test, t_train, t_test = train_test_split(X, t, test_size=0.2, random_state=0)
# define method
logreg=LogisticRegression()
# cross valitadion prediction
predicted = cross_validation.cross_val_predict(logreg, X_train, t_train, cv=10)
print(metrics.accuracy_score(t_train, predicted))
My problems:
From what I understand the test set should not be considered until the very end and cross-validation should be made on training set. That's why I inserted X_train and t_train in the cross_val_predict method. Thuogh, I get an error saying:
ValueError: Found input variables with inconsistent numbers of samples: [6016, 4812]
where 6016 is the number of samples in the whole dataset, and 4812 is the number of samples in the training set after the dataset has been split
After this, I don't know what to do. I mean: when do the X_test and t_test come into play? I don't get how I should use them after cross-validating and how to get the final accuracy.
Bonus question: I'd also like to perform scaling and reduction of dimensionality (through feature selection or PCA) within each step of the cross-validation. How can I do this? I've seen that defining a pipeline can help with scaling, but I don't know how to apply this to the second problem.
I'd really appreciate any help :-)