I have a relative big data:
more than 370,000 observations, categorical dependent variable with 250 levels,10 independent variables which including both numeric and categorical variables.
I want to perform a 10-folds cross-validation for model comparison(including classification tree model in 'rpart', svm in package 'e1071', kknn in package 'kknn', boosting and bagging in package 'adabag')
After reading the manual of these models, I try to write the codes for models performing, but I really do not know how to perform a 10-folds CV.
Actually I have tried, but I'm newly to R. I really need help for codes or functions of the 10-folds CV.
Here is my codes:
w <- read.csv('D:/R code/animal2.csv',header = T)
names(w)
[1] "cluster_ward" "AAT0" "ARIDITY" "TOPO" "TMAX"
[6] "PREMAX" "PREMIN" "AMT" "SU_CODE90" "T_OC"
[11] "ELEMAX"
nrow(w)
[1] 370827
w$TOPO <- as.factor(w$TOPO)
w$SU_CODE90 <- as.factor(w$SU_CODE90)
library(rpart.plot)
fit1 <- rpart(cluster_ward ~., w)
pred1 <- predict(fit1, w, type="class")
library(e1071)
fit2 <-svm(cluster_ward~., data=w, kernal="sigmoid")
pred2 <- predict(a, w)
library(kknn)
set.seed(1000)
fit3 <- kknn(cluster_ward~., train=w, test=w)
pred3 <- fit3$fit
library(adabag)
set.seed(1000)
fit4 <- boosting(cluster_ward~., w)
pred4 <- predict(fit4,w)$class
library(adabag)
set.seed(1000)
fit5 <- bagging(cluster_ward~., w)
pred5 <- predict(fit5,w)$class
Someone has told me that package 'cvTools' or 'caret' can preform the k-folds CV, but I'm still can't perform successfully with these packages or functions.