3
votes

Can someone please let me know, if this is the correct way to calculate the cross-validated precision of my classifier? I divided my dataset into xtrain and ytrain for training data and xtest & ytest for the test set.

Building the model:

RFC = RandomForestClassifier(n_estimators=100)

Fitting it to training set:

RFC.fit(xtrain, ytrain)

This is the part I am unsure about:

scores = cross_val_score(RFC, xtest, ytest, cv = 10, scoring='precision')

Using the code above, would "scores" give me the precision on my model which was trained on the Training data? I am very afraid that I used to wrong code and that I am fitting the model to xtest, because my recall and precision score for my test data is HIGHER than the scores for my training data which I couldn't figure out why!

1

1 Answers

2
votes

You don't actually have to do the fitting of the model yourself when you compute the cross-validation score.

The correct (simpler) way to do the cross-validated score is to just create the model like you do

RFC = RandomForestClassifier(n_estimators=100)

Then just compute the score

scores = cross_val_score(RFC, xtrain, ytrain, cv = 10, scoring='precision')

Usually in machine learning / statistics, you split your data on training and test set (as you did). After this the training data is used to validate the model (training parameters, cross-validation, etc.) and the final model is then tested on the test set. Thus, you wont actually use your test set in the cross-validation, only in the final phase when you want to have the final accuracy of the model.

Separating the data to training and test sets and doing the cross-validation on the training data has the advantage that you wont be overfitting model parameters (with Cross-Validation) when you have the separate test set which is only used in the final phase.

You can learn more here: cross_val_score and Cross-Validation