0
votes

I'm using training data set (i.e., X_train, y_train) when tuning the hyperparameters of my model. I need to use the test data set (i.e., X_test, y_test) as a final check, to make sure my model isn't biased. I wrote

folds = 4

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=(1/folds), random_state=38, stratify=y)

clf_logreg = Pipeline(steps=[('preprocessor', preprocessing),
                      ('model', LogisticRegression(solver='lbfgs', max_iter=100))])


cv = KFold(n_splits=(folds - 1))
scores_logreg = cross_val_score(clf_logreg, X_train, y_train, cv = cv)

and, to get f1-score,

cross_val_score(clf_logreg, X_train, y_train, scoring=make_scorer(f1_score, average='weighted'),
    cv=cv)

This returns

scores_logreg: [0.94422311, 0.99335548, 0.97209302] and for f1: [0.97201365, 0.9926906 , 0.98925453]

For checking the test, is it right to write

cross_val_score(clf_logreg, X_test, y_test, scoring=make_scorer(f1_score, average='weighted'), cv=cv) # not sure if it is ok to let cv

or maybe

predicted_logreg= clf_logreg.predict(X_test)
f1 = f1_score(y_test, predicted_logreg)

Value returned are different.

1
have you tried using classification_report in sklearn.metrics?Vibhav Surve

1 Answers

1
votes

cross_val_score is meant for scoring a model by cross-validation, if you do:

cross_val_score(clf_logreg, X_test, y_test, 
scoring=make_scorer(f1_score, average='weighted'), cv=cv)

You are putting redo-ing the cross validation on your test set, which does not make much sense except that you are now training your model on a smaller dataset, compared to your train.

I think the help page on cross validation on scikit learn illustrates it, you don't need to rerun a cross validation on your test set:

enter image description here

You just do:

predicted_logreg= clf_logreg.predict(X_test)
f1 = f1_score(y_test, predicted_logreg)