Select sample from train data based on fold from k-fold cross-validation

Question

I have performed the k-fold cross-validation without package based on here How to split a data set to do 10-fold cross validation using no packages

I need to select 30% of the sample from each fold in train data. Here is my function:

samples = 300
r = 0.83

library('MASS')

df = data.frame(mvrnorm(n=samples, mu=c(0, 0), Sigma=matrix(c(1, r, r, 1), nrow=2), empirical=TRUE))
w = df[sample(nrow(df)),]
w = data.frame(w)
kcv = 10
folds <- cut(seq(from = 1,to = nrow(w)),breaks=kcv,labels=FALSE)
kfolddata<-cbind(w,folds)


for(i in 1:kcv){ #i=1
  testIndexes <- which(kfolddata[,ncol(kfolddata)]==i,arr.ind=TRUE)
  testData <- w[testIndexes, ]
  trainData <- w[-testIndexes, ]
  trainIndexes <- kfolddata[-testIndexes,]

  if(i==1) {
    set.seed=1234
    SubInd = sample(nrow(trainData) , size = round(0.3 * 
                                (nrow(trainData))),replace=F)
  } else {
     SubInd = rbind(SubInd,sample(nrow(trainData) , size = round(0.3 *
                                 nrow(trainData))),replace=F))}}
  }
}

The results will only display the ID of the selected subset. How can I obtain the information (the variables) for the selected ID for SubInt?

Does using rbind is the correct way? since I need to do another looping from SubInt.

Instead of nrow(trainData), if you have trainIndexes and then use w[SubInd, ] at the end. If you have a proper example, it would have been easier to give a better answer. — Suren

kstew kstew · Accepted Answer · 2019-07-17T23:08:20

If your only goal is to randomly sample 30% of your training data for each fold, then you can try using lapply() instead of the for-loop. Combined with filter() and sample_frac(). With 1000 original cases, the first fold's training data will have 900 cases, so 270 are returned when sampling 30%.

# create df
df <- data.frame(x=runif(1000))

#Randomly shuffle the data
df <- df[sample(nrow(df)),]; df <- data.frame(x=df)

#Create 10 equally size folds
folds <- cut(seq(1,nrow(df)),breaks=10,labels=FALSE)
df$folds <- folds

df1 <- lapply(1:10,function(i){
  train <- df %>% filter(folds!=i) %>% sample_frac(.3)
})

lapply(df1,dim)

d <- df1[[1]]; d %>% count(folds) # check no test data, fold==1
d <- df1[[2]]; d %>% count(folds) # check no test data, fold==2

Select sample from train data based on fold from k-fold cross-validation

2 Answers