R - Predicting on new data in workflow

Question

I created a training and test set and tested my model. My workflow looks at follows:

# Test/train 
set.seed(2402) ## This generates a random order
splits <- initial_split(Data, prop = 0.7) ## 70% will be training data

# Create a train and test set
Data_train <- training(splits)
Data_test <- testing(splits)

# Specify the recipe
rf_mod <- rand_forest(mtry = tune(), min_n = tune(), trees = 200) %>%
  set_mode("regression") %>%
  set_engine("ranger", importance = "permutation")

# Create a workflow
rf_mod_workflow <-  workflow() %>%
  add_model(rf_mod) %>%
  add_recipe(rf_mod_recipe) 
rf_mod_workflow

# State our error metrics
class_metrics <- metric_set(rmse, mae)

Make the computation faster by registerDoParallel()

registerDoParallel()

rf_grid <- grid_regular(
  mtry(range = c(5, 15)),
  min_n(range = c(10, 200)),
  levels = 5
)

rf_grid

set.seed(654321)

rf_tune_res <- tune_grid(
  rf_mod_workflow,
  resamples = cv_folds,
  grid = rf_grid,
  metrics = class_metrics
)

# Select the best number of mtry
best_rmse <- select_best(rf_tune_res, "rmse")
rf_final_wf <- finalize_workflow(rf_mod_workflow, best_rmse)
rf_final_wf

# Finalise the workflow
set.seed(56789)
rf_final_fit <- rf_final_wf %>%
  last_fit(splits, metrics = class_metrics)

However, I now want to use my created model to predict on a new dataset. The problem is that this new dataset contains NA values. Is it still possible to predict on a dataset that has NA values, or does the random forest not allow it? I did something similar for a linear regression and that one ignored the NA values and only predicted for instances where no NA values are present.

Jonas Jonas · Accepted Answer · 2021-05-19T09:36:22

If ignoring NAs is fine for you, just remove them from your new data. Assuming your new data is a dataframe called newdata, you can get the indices of rows with at least one NA-value as folows

naRowIndices <- rowSums(is.na(newdata)) >= 1
newNonNaData <- newdata[!naRowIndices ,]

Then do the prediction newNonNaPredictions <- ...predict newNonNaData.... Now you have your predicitons on the data without NAs. If you need your predicition output of the same length as the original data with NAs, you need padding like:

newPrediction <- rep(NA,NROW(newdata))
newPrediction[!naRowIndices] <- newNonNaPredictions

R - Predicting on new data in workflow

1 Answers