1
votes

I am having some trouble in understanding how to implement cross validation. In my case I am trying to apply it to an LVQ system. This is what I understood so far...

One of the parameters that can be adjusted for LVQ is the number of prototypes to model each class. In order to find the best amount of prototypes, one must train the model on training data and then test the model on unseen data and calculate its performance. However depending on which data points you use for training and for validation, the performance result will vary. Hence cross validation can be used to get an average of the performance.

You repeat this for different amounts of prototypes and see which amount obtains the best average. Once this is done, what do you do next? Do you generate a new model on the entire training set corresponding to the amount of prototypes which obtained the best result, or do you use the model corresponding to the fold which obtained the highest accuracy during cross validation?

1
I'm voting to close this question as off-topic because it belongs to stats.stackexchange.com - Lior Kogan
Hi. Thanks for pointing that out. I will make sure to ask a question on the right site next time. - ganninu93

1 Answers

1
votes

Do you generate a new model on the entire training set corresponding to the amount of prototypes which obtained the best result, or do you use the model corresponding to the fold which obtained the highest accuracy during cross validation?

Once the CV is done, and you obtained the best parameters (in your case, the number of models), you fix them and train a model over the entire train dataset.

The rationale is as follows. Say your train dataset is tr, and you're trying to ascertain its performance on some other dataset te (where te is either a validation dataset, or the "real world"). Since you cannot test the effect of different parameters on te (either because it will overfit, or because te is the "real world", and is not available), you emulate it on tr by repeatedly splitting it into tr_cv and te_cv. Once you've obtained the best parameters, though, there is no reason not to use the entire data to build the model.