1
votes

I am new to R and am using a for loop in order to implement 5 fold cross validation with a C5.0 decision tree for an assignment. My dataset looks as follows:

head(data_known)
order_item_id order_date item_id item_size brand_id item_price user_id
1             1    2012-09    1507   UNSIZED      102       24.9   4694
2             2    2012-11    1745        10       64       75.0   6097
3             3    2013-01    2588       XXL       42       79.9   7223
4             4    2012-08     164        40       47       79.9   4124
5             5    2012-09    1640         L       97       69.9    881
6             6    2013-03    2378        38       72      129.9   1576
user_title user_dob             user_state user_reg_date
1        Mrs  1964-11   Rhineland-Palatinate       2011-02
2        Mrs  1973-08            Brandenburg       2011-05
3        Mrs  1949-08               Saarland       2013-01
4        Mrs  1960-12              Thuringia       2012-08
5        Mrs  1971-06     Baden-Wuerttemberg       2012-01
6        Mrs  1965-10 North Rhine-Westphalia       2011-02   
delivery_time_days user_title_NA item_size_NA user_dob_NA    target
1                  2             0            0           0    Return
2                  4             0            0           0 No Return
3                  2             0            0           0    Return
4                  5             0            0           0    Return
5                  3             0            0           0    Return
6                 11             0            0           0    Return

Now, my loop is:

explanatory_variables.dt<-names(data_known)[-16]
form.dt<-as.formula(paste("target ~", paste(explanatory_variables.dt,    
collapse = "+")))  
folds.dt<-split(data_known,cut(sample(1:nrow(data_known)),5))
errs.c50.dt<-rep(NA,length(folds.dt))

for (i in 1:length(folds.dt)) {
test.dt<-ldply(folds.dt[i],data.frame)
train.dt<-ldply(folds.dt[-i],data.frame)
tmp.model.dt<-C5.0(form.dt,train.dt)                      
tmp.predict.dt<-predict(tmp.model.dt, newdata=test.dt)      
conf.mat.dt<-table(test.dt$target,tmp.predict.dt)
errs.c50.dt[i]<-1-sum(diag(conf.mat.dt))/sum(conf.mat.dt)        
  }
print(sprintf("average error using k-fold cross validation and C5.0       
decision tree algorithm: %.3f percent", 100*mean(errs.c50.dt)))

How do I access/safe the whole tree model in the loop in order to predict the outcome of the target variable in another dataset where its true realizations are still unknown? Or do I have to base the predictions on tmp.model.dt alone when using cross validation?

Thank you in advance for your help.

Best,

Nico

1
The structure you're after is a list. Create one and store the model there. You can save the list using save for later use.Roman Luštrik
Thank you for the quick reply, Roman. I was able to solve it by now due to comments from you and j.Nico

1 Answers

0
votes

Here is a simple reproducible answer that expands upon Roman's comment.

list_models <- list()
for (i in 1:2){
   tmp_data <- mtcars[,c(1, i+1)]
   list_models[[i]] <- lm(mpg ~ ., data = tmp_data)
}
head(predict(list_models[[1]], newdata = mtcars))
head(predict(list_models[[2]], newdata = mtcars))

I am using lm here, but this will work just as well with C5.0 as the predict function will work on either model object.