1
votes

I am doing a classification task using libsvm. I have a 10 fold cross validation where the F1 score is 0.80. However, when I split the training dataset into two (one is for training and the other is for testing, which I call it holdout test set) the F1 score drops to 0.65. The split is in .8 to .2 ratio.

So, my question is that is there any significant difference in doing k-fold cross validation vs. holdout test? Which of these two techniques will produce a model that generalizes well? In both cases, my dataset is scaled.

2

2 Answers

5
votes

There are huge differences, however the exact analysis requires much of statistics. For deep understanding, refer to The Elements of Statistical Learning: Data Mining, Inference, and Prediction by Hastie, Tibshirani and Friedman.

In short :

  • single train-test split is unreliable measure of model quality (unless you have very large dataset)
  • repeated train-test splits converge to the true score given that the training set is representatible for the underlying distribution, however in practise they are often overoptimistic
  • CV tends to give lower scores of model quality as compared to train-test splits and gives you reasonable answers much faster, however at the cost of higher computational complexity.
  • If you have large set of data (>50 000 samples) then train-test split might be enough
  • If you have enough time, CV is nearly always a better (less optimistic) way to measure classifier quality
  • There are more methods than just these two, you might also want to look at methods from err0.632 family (bootstrap)
0
votes

The difference is because of using one split and if you try another way of splitting the data into train/test (perhaps by shuffling) you get another value. Therefore, creating several sets and averaging over all F1 scores will give a result which is equivalent to CV. And CV generalizes better.