I am new to R and am using a for loop in order to implement 5 fold cross validation with a C5.0 decision tree for an assignment. My dataset looks as follows:
head(data_known)
order_item_id order_date item_id item_size brand_id item_price user_id
1 1 2012-09 1507 UNSIZED 102 24.9 4694
2 2 2012-11 1745 10 64 75.0 6097
3 3 2013-01 2588 XXL 42 79.9 7223
4 4 2012-08 164 40 47 79.9 4124
5 5 2012-09 1640 L 97 69.9 881
6 6 2013-03 2378 38 72 129.9 1576
user_title user_dob user_state user_reg_date
1 Mrs 1964-11 Rhineland-Palatinate 2011-02
2 Mrs 1973-08 Brandenburg 2011-05
3 Mrs 1949-08 Saarland 2013-01
4 Mrs 1960-12 Thuringia 2012-08
5 Mrs 1971-06 Baden-Wuerttemberg 2012-01
6 Mrs 1965-10 North Rhine-Westphalia 2011-02
delivery_time_days user_title_NA item_size_NA user_dob_NA target
1 2 0 0 0 Return
2 4 0 0 0 No Return
3 2 0 0 0 Return
4 5 0 0 0 Return
5 3 0 0 0 Return
6 11 0 0 0 Return
Now, my loop is:
explanatory_variables.dt<-names(data_known)[-16]
form.dt<-as.formula(paste("target ~", paste(explanatory_variables.dt,
collapse = "+")))
folds.dt<-split(data_known,cut(sample(1:nrow(data_known)),5))
errs.c50.dt<-rep(NA,length(folds.dt))
for (i in 1:length(folds.dt)) {
test.dt<-ldply(folds.dt[i],data.frame)
train.dt<-ldply(folds.dt[-i],data.frame)
tmp.model.dt<-C5.0(form.dt,train.dt)
tmp.predict.dt<-predict(tmp.model.dt, newdata=test.dt)
conf.mat.dt<-table(test.dt$target,tmp.predict.dt)
errs.c50.dt[i]<-1-sum(diag(conf.mat.dt))/sum(conf.mat.dt)
}
print(sprintf("average error using k-fold cross validation and C5.0
decision tree algorithm: %.3f percent", 100*mean(errs.c50.dt)))
How do I access/safe the whole tree model in the loop in order to predict the outcome of the target variable in another dataset where its true realizations are still unknown? Or do I have to base the predictions on tmp.model.dt alone when using cross validation?
Thank you in advance for your help.
Best,
Nico
save
for later use. – Roman Luštrik