3
votes

I wish to run random forest in parallel using caret package, and I wish to set the seeds for reproducible result as in Fully reproducible parallel models using caret. However, I don't understand line 9 in the following code taken from caret help: why do we sample 22 (plus the last model in line 12, 23) integer numbers (12 values for parameter k are evaluated)? For information, I wish to run 5-fold CV to evaluate 584 values for RF parameter 'mtry'. Any help is much appreciated. Thank you.

## Not run:

## Do 5 repeats of 10-Fold CV for the iris data. We will fit
## a KNN model that evaluates 12 values of k and set the seed
## at each iteration.

set.seed(123)
seeds <- vector(mode = "list", length = 51)
for(i in 1:50) seeds[[i]] <- sample.int(1000, 22) # Why 22?

## For the last model:
seeds[[51]] <- sample.int(1000, 1)

ctrl <- trainControl(method = "repeatedcv", 
                 repeats = 5,
                 seeds = seeds)
1
Yes it looks like a mistake, you would expect to sample it 50 times. Even if they wanted to reuse the seeds between folds, you would expect a multiple of 10. I'd say it's a mistake.smci

1 Answers

2
votes

I'd say it is a mistake, and should be 12 instead of 22.

From what I understand, you will be running the model 10*5 = 50 times, for each value of k. Hence, for each i in 1:50, you'll need 12 seeds (one for every k). After obtaining the best k, you will run the final model. This time, you only need one seed (no more repeated resampling).