1
votes

I have a relative big data:

more than 370,000 observations, categorical dependent variable with 250 levels,10 independent variables which including both numeric and categorical variables.

I want to perform a 10-folds cross-validation for model comparison(including classification tree model in 'rpart', svm in package 'e1071', kknn in package 'kknn', boosting and bagging in package 'adabag')

After reading the manual of these models, I try to write the codes for models performing, but I really do not know how to perform a 10-folds CV.

Actually I have tried, but I'm newly to R. I really need help for codes or functions of the 10-folds CV.

Here is my codes:

w <- read.csv('D:/R code/animal2.csv',header = T)
names(w)
[1] "cluster_ward" "AAT0"         "ARIDITY"      "TOPO"         "TMAX"        
[6] "PREMAX"       "PREMIN"       "AMT"          "SU_CODE90"    "T_OC"        
[11] "ELEMAX"  

nrow(w)
[1] 370827  

w$TOPO <- as.factor(w$TOPO)
w$SU_CODE90 <- as.factor(w$SU_CODE90)  

library(rpart.plot)  
fit1 <- rpart(cluster_ward ~., w)
pred1 <- predict(fit1, w, type="class")  

library(e1071)
fit2 <-svm(cluster_ward~., data=w, kernal="sigmoid")
pred2 <- predict(a, w)

library(kknn)
set.seed(1000)
fit3 <- kknn(cluster_ward~., train=w, test=w)
pred3 <- fit3$fit

library(adabag)
set.seed(1000)
fit4 <- boosting(cluster_ward~., w)
pred4 <- predict(fit4,w)$class

library(adabag)
set.seed(1000)
fit5 <- bagging(cluster_ward~., w)
pred5 <- predict(fit5,w)$class

Someone has told me that package 'cvTools' or 'caret' can preform the k-folds CV, but I'm still can't perform successfully with these packages or functions.

2

2 Answers

2
votes

I usually prefer to implement the CV by myself as it is relatively easy and let you have control over the algorithms you can use and the evaluation metric.

k = 10 # Number of k-folds
id = sample(1:k,nrow(data),replace=TRUE)
list = 1:k
for (i in 1:k){
  trainingset = subset(data, id %in% list[-i])
  testset = subset(data, id %in% c(i))

  # Training
  fit.glm = glm(cluster_ward ~ ., data=trainingset)


  # Testing
  pred = predict(fit.glm, testset, type="response")
  real = testset$cluster_ward
  rmse =  sqrt(sum((pred - real) ^ 2))/length(real)

}
1
votes

Answer given by kahlo is good but it doesn't give equal sized folds. Here I present my method of working

k = 10 # For k-folds
data$class<-sample(1:nrow(data),nrow(data),replace=FALSE)
len.data <- length(data$class)
for(i in 1:k){
    data$class[data$class <= i*len.data/k & data$class > (i-1)*len.data/k]<-i
}
list = 1:k
for (i in 1:k){
  train.set = subset(data, class %in% list[-i])
  test.set = subset(data, class %in% i))

  ## Train using train.test
  ## Test using test.set

}