I am trying to build a random forest model for prices prediction problem. I have went through the following steps:
1) split the data into 3 sets train, test and valid (it is required to split into 3 sets not only train and test)
set.seed(1234)
assignment <- sample(1:3, size = nrow(train), prob = c(0.7, 0.15, 0.15), replace = TRUE)
#Create a train, validation and tests from the train data
train_train <- train[assignment == 1, ]
train_valid <- train[assignment == 2, ]
train_test <- train[assignment == 3, ]
2) I have built the model with x and y being from the train set
fit_rf_train <- train(x = train_train[, -which(names(train_train) %in%
c("Item_Identifier", "Item_Outlet_Sales"))],
y = train_train$Item_Outlet_Sales,
method = "ranger",
metric = "RMSE",
tuneGrid = expand.grid(
.mtry = 6,
.splitrule = "variance",
.min.node.size = c(10,15,20)),
trControl = trControl,
importance = "permutation",
num.trees = 350)
I have the following screenshot for model output on the same train data:
3) Using predict function I used the model with the two other data sets, valid and test using the following line of code:
prediction_test <- predict(fit_rf_train, train_test)
prediction_valid <- predict(fit_rf_train, train_valid)
My question is how can I measure the performance of the model on the un seen data (test and valid)?
caret::RMSE
assuming you're usingcaret
– NelsonGonconfusionMatrix.
Now instead, you're usingRMSE()
.Type?RMSE
you'll see several options. – NelsonGontrain
. Please add adput
oftrain
– NelsonGon