1
votes

I am trying to build a random forest model for prices prediction problem. I have went through the following steps:

1) split the data into 3 sets train, test and valid (it is required to split into 3 sets not only train and test)

set.seed(1234)
assignment <- sample(1:3, size = nrow(train), prob = c(0.7, 0.15, 0.15), replace = TRUE) 
#Create a train, validation and tests from the train data
train_train <- train[assignment == 1, ]  
train_valid <- train[assignment == 2, ]  
train_test <- train[assignment == 3, ] 

2) I have built the model with x and y being from the train set

fit_rf_train <- train(x = train_train[, -which(names(train_train) %in% 
c("Item_Identifier", "Item_Outlet_Sales"))], 
                y = train_train$Item_Outlet_Sales,
                method = "ranger",
                metric = "RMSE",
                tuneGrid = expand.grid(
                  .mtry = 6,
                  .splitrule = "variance",
                  .min.node.size = c(10,15,20)),
                trControl = trControl,
                importance = "permutation",
                num.trees = 350)

I have the following screenshot for model output on the same train data:

Model output on train data

3) Using predict function I used the model with the two other data sets, valid and test using the following line of code:

prediction_test <- predict(fit_rf_train, train_test)
prediction_valid <- predict(fit_rf_train, train_valid)

My question is how can I measure the performance of the model on the un seen data (test and valid)?

1
Which package are you using? How can you measure the performance? I think by using the RMSE as you chose that as your metric. The lower, the better. caret::RMSE assuming you're using caretNelsonGon
Caret package. Yes I know I will be using RMSE. But I mean which function to use in order to print me the performance of predict function. I mean using this line print(fit_rf_train) I could see the RMSE value on the train set now how can I do it with predict.user233531
How would you do it for a classification? You use confusionMatrix. Now instead, you're using RMSE().Type ?RMSE you'll see several options.NelsonGon
If I type RMSE(prediction_test) I would get this error: argument "obs" is missing, with no default. Should I use the train data as the value of obs argument?user233531
I have no access to train. Please add a dput of trainNelsonGon

1 Answers

1
votes

If you want to stick with caret, then you can do the following:

library(caret)
trainda<-createDataPartition(iris$Sepal.Length,p=0.8,list=F)
valid_da<-iris[-trainda,]
trainda<-iris[trainda,]
ctrl<-trainControl(method="cv",number=5)
set.seed(233)
m<-train(Sepal.Length~.,data=trainda,method="rf",metric="RMSE",trControl = ctrl,verbose=F)
m1<-predict(m,valid_da)
RMSE(m1,valid_da$Sepal.Length)

Result:

[1] 0.3499783