R - Random Forest Prediction on new data

Question

I'm working with Random Forests/Logistic Regression models for my predictions. Part of my study is to create a 'new' data frame to resemble a new patient, and predict the likelihood of them experiencing death within 30 days of a procedure. After performing double-cross validation to get an accuracy rating I'm currently fitting my data on the full dataset:

#Logistic Regression Model:
fullModelMort = glm(mort30~ahrq_ccs+age+asa_status+bmi+baseline_cancer+baseline_cvd+baseline_dementia+baseline_diabetes+baseline_digestive+baseline_osteoart+baseline_psych+baseline_pulmonary,data=surgery,family="binomial")

#Random Forest Model:
surgery.bag = randomForest(mort30~ahrq_ccs+age+asa_status+bmi+baseline_cancer+baseline_cvd+baseline_dementia+baseline_diabetes+baseline_digestive+baseline_osteoart+baseline_psych+baseline_pulmonary,data=surgery,mtry=2,importance=T,cutoff=c(0.95,0.05))

I then am creating 'New Patients' to feed into my models for prediction and predicting the probability of mortality based on these inputs:

#New Patients for Predictions
newPatient1=data.frame(ahrq_ccs="Colorectal resection",age=70,asa_status="IV-VI",bmi=27.9,baseline_cancer="Yes",baseline_cvd="Yes",baseline_dementia="No",baseline_diabetes="No",baseline_digestive="No",baseline_osteoart="No",baseline_psych="No",baseline_pulmonary="No")

newPatient2=data.frame(ahrq_ccs="Gastrectomy; partial and total",age=34,asa_status="III",bmi=22.9,baseline_cancer="No",baseline_cvd="Yes",baseline_dementia="No",baseline_diabetes="No",baseline_digestive="No",baseline_osteoart="Yes",baseline_psych="No",baseline_pulmonary="No")

#Predict using LR Model
Patient1 = predict(fullModelMort, newPatient1, type="response")
Patient2 = predict(fullModelMort, newPatient2, type="response")

#Classify whether a patient is High/Low Risk based on probability for mortality:
Determine_Mortality = function(prediction){
  if(prediction > .05){
    Response=paste("High Risk:", round(prediction*100,2) ,"% Chance of Mortality")
    return(Response)
  }
  else{
    Response=paste("Low Risk:", round(prediction*100,2) ,"% Chance of Mortality")
    return(Response)
  }
}

print(paste0("Patient 1 Results - ", Determine_Mortality(Patient1)))
print(paste0("Patient 2 Results - ", Determine_Mortality(Patient2)))

This section works fine for the logistic regression model, however, when I try to do the same thing for my Random Forest Models, I'm getting the following error:

Error in predict.randomForest(surgery.bag, newPatient1, type = "response") : Type of predictors in new data do not match that of the training data.

This is my code for the Random Forest Predictions:

newPatient1=data.frame(ahrq_ccs="Colorectal resection",age=70,asa_status="IV-VI",bmi=27.9,baseline_cancer="Yes",baseline_cvd="Yes",baseline_dementia="No",baseline_diabetes="No",baseline_digestive="No",baseline_osteoart="No",baseline_psych="No",baseline_pulmonary="No")

newPatient2=data.frame(ahrq_ccs="Gastrectomy; partial and total",age=34,asa_status="III",bmi=22.9,baseline_cancer="No",baseline_cvd="Yes",baseline_dementia="No",baseline_diabetes="No",baseline_digestive="No",baseline_osteoart="Yes",baseline_psych="No",baseline_pulmonary="No")

test1=predict(surgery.bag,newPatient1,type="response")

Summary of the dataset used to fit the models (Although only a subset of these columns are used in the fitting)

                               ahrq_ccs          age        gender                  race       asa_status         bmi        baseline_cancer baseline_cvd baseline_dementia
 Arthroplasty knee                  : 3032   Min.   : 1.00   F:15279   African American: 3416   I-II :15244   Min.   : 2.15   No :18593       No :13947    No :28087        
 Nephrectomy; partial or complete   : 2559   1st Qu.:48.30   M:13008   Caucasian       :23768   III  :12142   1st Qu.:24.61   Yes: 9694       Yes:14340    Yes:  200        
 Spinal fusion                      : 2377   Median :58.60             Other           : 1103   IV-VI:  901   Median :28.20                                                 
 Open prostatectomy                 : 2356   Mean   :57.71                                                    Mean   :29.47                                                 
 Colorectal resection               : 2269   3rd Qu.:68.30                                                    3rd Qu.:32.84                                                 
 Hysterectomy; abdominal and vaginal: 2253   Max.   :90.00                                                    Max.   :92.59                                                 
 (Other)                            :13441                                                                                                                                  
 baseline_diabetes baseline_digestive baseline_osteoart baseline_psych baseline_pulmonary baseline_charlson mortality_rsi     complication_rsi  ccsMort30Rate      ccsComplicationRate
 No :24582         No :22021          No :23195         No :25639      No :25202          Min.   : 0.000    Min.   :-4.4000   Min.   :-4.7200   Min.   :0.000000   Min.   :0.01612    
 Yes: 3705         Yes: 6266          Yes: 5092         Yes: 2648      Yes: 3085          1st Qu.: 0.000    1st Qu.:-1.2400   1st Qu.:-0.8600   1st Qu.:0.000789   1st Qu.:0.08198    
                                                                                          Median : 0.000    Median :-0.3000   Median :-0.3100   Median :0.002764   Median :0.10937    
                                                                                          Mean   : 1.178    Mean   :-0.5385   Mean   :-0.4258   Mean   :0.004328   Mean   :0.13322    
                                                                                          3rd Qu.: 2.000    3rd Qu.: 0.0000   3rd Qu.: 0.0000   3rd Qu.:0.007398   3rd Qu.:0.18337    
                                                                                          Max.   :13.000    Max.   : 4.8300   Max.   :12.5600   Max.   :0.016673   Max.   :0.46613    
                                                                                                                                                                                      
      hour         dow           month               moonphase    mort30    complication
 Min.   : 6.000   Fri:5351   Jun    : 2845   First Quarter:7126   0:28170   0:24542     
 1st Qu.: 7.000   Mon:6223   Aug    : 2734   Full Moon    :7175   1:  117   1: 3745     
 Median : 9.000   Thu:4936   Mar    : 2587   Last Quarter :7159                         
 Mean   : 9.854   Tue:6258   Apr    : 2547   New Moon     :6827                         
 3rd Qu.:12.000   Wed:5519   Jan    : 2534                                              
 Max.   :18.000              May    : 2524                                              
                             (Other):12516

I'm curious if it's because the levels of my test data don't match the levels of the dataset used for fitting, but I can't be certain.

The dataset can be downloaded here Reproducible Code:

library(MASS)
library(ggplot2)
library(Hmisc)
library(corrplot)
library(dplyr)
library(randomForest)
library(tidyr)

#Read in the data set
surgery=read.csv("SurgeryTiming.csv")

#Remove dummy values
surgery$gender[surgery$gender == ""] <- NA
surgery$asa_status[surgery$asa_status == ""] <- NA
surgery$race[surgery$race == ""] <- NA
surgery$bmi[surgery$bmi == ""] <- NA

surgery$hour = as.numeric(sub("\\..*", "", as.character(surgery$hour))) #Split out the base hour of surgery

#Drop NA values
surgery = surgery %>% drop_na(gender)
surgery = surgery %>% drop_na(asa_status)
surgery = surgery %>% drop_na(race)
surgery = surgery %>% drop_na(age)
surgery = surgery %>% drop_na(bmi)

#Drop additional levels that now have no values
surgery$gender = droplevels(surgery$gender)
surgery$asa_status = droplevels(surgery$asa_status)
surgery$race = droplevels(surgery$race)

#View our numeric data distributions
num_data <- surgery[,sapply(surgery,is.numeric)]
hist.data.frame(num_data)

surgery$complication=revalue(surgery$complication,c("Yes"=1)) 
surgery$complication=revalue(surgery$complication,c("No"=0))
surgery$mort30=revalue(surgery$mort30,c("Yes"=1))
surgery$mort30=revalue(surgery$mort30,c("No"=0))

newPatient1=data.frame(ahrq_ccs="Colorectal resection",age=70,asa_status="IV-VI",bmi=27.9,baseline_cancer="Yes",baseline_cvd="Yes",baseline_dementia="No",baseline_diabetes="No",baseline_digestive="No",baseline_osteoart="No",baseline_psych="No",baseline_pulmonary="No")


newPatient2=data.frame(ahrq_ccs="Gastrectomy; partial and total",age=34,asa_status="III",bmi=22.9,baseline_cancer="No",baseline_cvd="Yes",baseline_dementia="No",baseline_diabetes="No",baseline_digestive="No",baseline_osteoart="Yes",baseline_psych="No",baseline_pulmonary="No")



surgery.bag = randomForest(mort30~ahrq_ccs+age+asa_status+bmi+baseline_cancer+baseline_cvd+baseline_dementia+baseline_diabetes+baseline_digestive+baseline_osteoart+baseline_psych+baseline_pulmonary,data=surgery,mtry=2,importance=T,cutoff=c(0.95,0.05)) #The cutoff is the probability for each group selection, probs of 1% or higher are classified as 'Mortality' occuring

test1=predict(surgery.bag,newPatient1,type="response")

Any suggestions/advice is extremely appreciated.

Does this answer your question? Type Mismatch Error using randomForest in R — s__
All names, variable types and levels must be the same between the data going into he model and the "newdata" specified in predict. Look at the data.frames using str to verify this information. — Jeffrey Evans
@JeffreyEvans If my predictor variables are a subset of the whole dataset, do I only need to create levels for those variables? I just updated my code to include all different levels for the variables I am using in the prediction, but still getting the same error. — peakstatus
You may have a level(s) in the subset that is not in the data passed to randomForest. This would most certainly throw this error. It should not throw an error the other direction (level in the model that is not in the newdata) but, you never know. — Jeffrey Evans

bstrain bstrain · Accepted Answer · 2020-06-30T14:39:31

I can't offer a solution but the error that you're getting has led me to try:

test1=predict(surgery.bag,newPatient1,type="response")
test2 <- predict(surgery.bag,surgery[7,],type = "response")

The last line will run, making a prediction on row #7 of the original dataframe. This tells me that there is an issue in the way that you are creating newPatient1 and newPatient2. The data in those rows has the correct type (factor, num, etc...) but when you create (for example) newPatient1$baseline_cancer it is a factor with only 1 level, whereas surgery$baseline_cancer is a factor with 2 levels. You can see the differences by calling:

str(surgery[,names(newPatient1)])
str(newPatient1)

I am not sure how to create newPatient1 with the correct level of factors for each variable.

R - Random Forest Prediction on new data

1 Answers