Splitting test/training data for scikit?

0

votes

I was given some starter code, but I'm not sure how to split it up when calling train_test_split (which I was explicitly told to use). Essentially, where does it come into play when I'm already given an X_train, Y_train, and X_test split?

The starter code looks like so:

train_df = pd.read_csv('./train_preprocessed.csv')
test_df = pd.read_csv('./test_preprocessed.csv')
X_train = train_df.drop("Survived",axis=1)
Y_train = train_df["Survived"]
X_test = test_df.drop("PassengerId",axis=1).copy()
print(train_df[train_df.isnull().any(axis=1)])

##SVM
svc = SVC()
svc.fit(X_train, Y_train)
Y_pred = svc.predict(X_test)
acc_svc = round(svc.score(X_train, Y_train) * 100, 2)
print("svm accuracy is:", acc_svc)

I need to change the acc_svc variable to be using X_test and Y_test, however. X_test is given to us, but how do I come up with a Y_test? I know the Y_test should correspond to labels, and I'm having some size mismatching going on when I attempt to do so. Should be a simple question, anyone mind pointing me in the right direction?

pythonmachine-learningscikit-learn

0

votes

The test_preprocessed.csv shouldn't be used to check your model performance. Split your train_df using train_test_split() in scikit-learn into train and validation datasets. You have to check your model performance on validation dataset i.e. y of validation. Please refer to: scikit-learn documentation

0

votes

First of all, you have to understand and clarify your target variable. Your "Y_test" seems to be your already existed "Y_pred" variable, which seems to correspond to the "Survived" label (in your test set). However, although you are dropping it from the "X_train" so that you can use it as a target, you don't seem to do the same in the "Y_train", where instead you are dropping "PassengerId".
Another basic concept here is that your dataset is already split into train-test subsets (your CSV files). I assume that your test set has already one less column compared to the train set, and that should be the "Survived" variable as a continuation from the train CSV file. Otherwise, you should drop it to avoid mismatching and keep that as your test target variable. You don't have to come up with a "Y_test", the result from your equation "Y_pred = svc.predict(X_test)" will give you the "Y_test" which would be the result of the "Y_pred".
One possible reason you get size mismatching is that the number of columns (x-axis) in your train set is not equal with that of the test set.
If you want to split into train/test subsets based on Scikit-learn you would first merge your CSV files, then do the data analysis in the merged dataset, and finally, do the split. One way to keep track of these changes and maintain the same original size of the train-test split could be to keep key-value pairs originated from the train-test merge. One way to do that could be via the pandas.concat, using the parameter "keys".

Incorporating the above, one recommended simple solution might be:

# reading csv files
train_df = pd.read_csv('./train_preprocessed.csv')
test_df = pd.read_csv('./test_preprocessed.csv')

# merge train and test sets
merged_data = pd.concat([train_df, test_df], keys=[0,1])

# data preprocessing can take place in the below assigned variable
# here also you could do feature engineering etc.
# e.g. check null values for all dataset
print(merged_data[merged_data.isnull().any(axis=1)])

# now you can eject the train and test sets, using the key-value pairs from the train-test merge
X_train = merged_data.xs(0)
X_test = merged_data.xs(1)

# setting up predictors - target
X= X_train.loc[:, X_train.columns!="Survived"]
y= X_train.loc[:, "Survived"]

# train-test split
# If train_size is None, it will be set to 0.25 based on the documentation
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

##SVM
svc = SVC()
svc.fit(X_train, y_train)
y_pred = svc.predict(X_test)
acc_svc = round(svc.score(X_train, y_train) * 100, 2)
print("svm accuracy is:", acc_svc)

In my opinion, after understanding the above you could further estimate and compare your model's performance using the cross_val_score function, in a way @SunilG mentions. For e.g. a 3-fold (CV=3) cross validation, you could:

from sklearn.model_selection import cross_val_score
cross_val_score(svc, X_train, y_train.values, cv=3, scoring='accuracy')

If you do not want to proceed to the above and you want to be close to your starter code, then you should delete your 5th line of code and I suppose it would run (if your test set does not include your target variable, otherwise drop it). However in this case you would not be able to split your train-test on your own, since it is already split, hence the title of your main question/post should be altered.

Splitting test/training data for scikit?

2 Answers