I'm working with Random Forests/Logistic Regression models for my predictions. Part of my study is to create a 'new' data frame to resemble a new patient, and predict the likelihood of them experiencing death within 30 days of a procedure. After performing double-cross validation to get an accuracy rating I'm currently fitting my data on the full dataset:
#Logistic Regression Model:
fullModelMort = glm(mort30~ahrq_ccs+age+asa_status+bmi+baseline_cancer+baseline_cvd+baseline_dementia+baseline_diabetes+baseline_digestive+baseline_osteoart+baseline_psych+baseline_pulmonary,data=surgery,family="binomial")
#Random Forest Model:
surgery.bag = randomForest(mort30~ahrq_ccs+age+asa_status+bmi+baseline_cancer+baseline_cvd+baseline_dementia+baseline_diabetes+baseline_digestive+baseline_osteoart+baseline_psych+baseline_pulmonary,data=surgery,mtry=2,importance=T,cutoff=c(0.95,0.05))
I then am creating 'New Patients' to feed into my models for prediction and predicting the probability of mortality based on these inputs:
#New Patients for Predictions
newPatient1=data.frame(ahrq_ccs="Colorectal resection",age=70,asa_status="IV-VI",bmi=27.9,baseline_cancer="Yes",baseline_cvd="Yes",baseline_dementia="No",baseline_diabetes="No",baseline_digestive="No",baseline_osteoart="No",baseline_psych="No",baseline_pulmonary="No")
newPatient2=data.frame(ahrq_ccs="Gastrectomy; partial and total",age=34,asa_status="III",bmi=22.9,baseline_cancer="No",baseline_cvd="Yes",baseline_dementia="No",baseline_diabetes="No",baseline_digestive="No",baseline_osteoart="Yes",baseline_psych="No",baseline_pulmonary="No")
#Predict using LR Model
Patient1 = predict(fullModelMort, newPatient1, type="response")
Patient2 = predict(fullModelMort, newPatient2, type="response")
#Classify whether a patient is High/Low Risk based on probability for mortality:
Determine_Mortality = function(prediction){
if(prediction > .05){
Response=paste("High Risk:", round(prediction*100,2) ,"% Chance of Mortality")
return(Response)
}
else{
Response=paste("Low Risk:", round(prediction*100,2) ,"% Chance of Mortality")
return(Response)
}
}
print(paste0("Patient 1 Results - ", Determine_Mortality(Patient1)))
print(paste0("Patient 2 Results - ", Determine_Mortality(Patient2)))
This section works fine for the logistic regression model, however, when I try to do the same thing for my Random Forest Models, I'm getting the following error:
Error in predict.randomForest(surgery.bag, newPatient1, type = "response") : Type of predictors in new data do not match that of the training data.
This is my code for the Random Forest Predictions:
newPatient1=data.frame(ahrq_ccs="Colorectal resection",age=70,asa_status="IV-VI",bmi=27.9,baseline_cancer="Yes",baseline_cvd="Yes",baseline_dementia="No",baseline_diabetes="No",baseline_digestive="No",baseline_osteoart="No",baseline_psych="No",baseline_pulmonary="No")
newPatient2=data.frame(ahrq_ccs="Gastrectomy; partial and total",age=34,asa_status="III",bmi=22.9,baseline_cancer="No",baseline_cvd="Yes",baseline_dementia="No",baseline_diabetes="No",baseline_digestive="No",baseline_osteoart="Yes",baseline_psych="No",baseline_pulmonary="No")
test1=predict(surgery.bag,newPatient1,type="response")
Summary of the dataset used to fit the models (Although only a subset of these columns are used in the fitting)
ahrq_ccs age gender race asa_status bmi baseline_cancer baseline_cvd baseline_dementia
Arthroplasty knee : 3032 Min. : 1.00 F:15279 African American: 3416 I-II :15244 Min. : 2.15 No :18593 No :13947 No :28087
Nephrectomy; partial or complete : 2559 1st Qu.:48.30 M:13008 Caucasian :23768 III :12142 1st Qu.:24.61 Yes: 9694 Yes:14340 Yes: 200
Spinal fusion : 2377 Median :58.60 Other : 1103 IV-VI: 901 Median :28.20
Open prostatectomy : 2356 Mean :57.71 Mean :29.47
Colorectal resection : 2269 3rd Qu.:68.30 3rd Qu.:32.84
Hysterectomy; abdominal and vaginal: 2253 Max. :90.00 Max. :92.59
(Other) :13441
baseline_diabetes baseline_digestive baseline_osteoart baseline_psych baseline_pulmonary baseline_charlson mortality_rsi complication_rsi ccsMort30Rate ccsComplicationRate
No :24582 No :22021 No :23195 No :25639 No :25202 Min. : 0.000 Min. :-4.4000 Min. :-4.7200 Min. :0.000000 Min. :0.01612
Yes: 3705 Yes: 6266 Yes: 5092 Yes: 2648 Yes: 3085 1st Qu.: 0.000 1st Qu.:-1.2400 1st Qu.:-0.8600 1st Qu.:0.000789 1st Qu.:0.08198
Median : 0.000 Median :-0.3000 Median :-0.3100 Median :0.002764 Median :0.10937
Mean : 1.178 Mean :-0.5385 Mean :-0.4258 Mean :0.004328 Mean :0.13322
3rd Qu.: 2.000 3rd Qu.: 0.0000 3rd Qu.: 0.0000 3rd Qu.:0.007398 3rd Qu.:0.18337
Max. :13.000 Max. : 4.8300 Max. :12.5600 Max. :0.016673 Max. :0.46613
hour dow month moonphase mort30 complication
Min. : 6.000 Fri:5351 Jun : 2845 First Quarter:7126 0:28170 0:24542
1st Qu.: 7.000 Mon:6223 Aug : 2734 Full Moon :7175 1: 117 1: 3745
Median : 9.000 Thu:4936 Mar : 2587 Last Quarter :7159
Mean : 9.854 Tue:6258 Apr : 2547 New Moon :6827
3rd Qu.:12.000 Wed:5519 Jan : 2534
Max. :18.000 May : 2524
(Other):12516
I'm curious if it's because the levels of my test data don't match the levels of the dataset used for fitting, but I can't be certain.
The dataset can be downloaded here Reproducible Code:
library(MASS)
library(ggplot2)
library(Hmisc)
library(corrplot)
library(dplyr)
library(randomForest)
library(tidyr)
#Read in the data set
surgery=read.csv("SurgeryTiming.csv")
#Remove dummy values
surgery$gender[surgery$gender == ""] <- NA
surgery$asa_status[surgery$asa_status == ""] <- NA
surgery$race[surgery$race == ""] <- NA
surgery$bmi[surgery$bmi == ""] <- NA
surgery$hour = as.numeric(sub("\\..*", "", as.character(surgery$hour))) #Split out the base hour of surgery
#Drop NA values
surgery = surgery %>% drop_na(gender)
surgery = surgery %>% drop_na(asa_status)
surgery = surgery %>% drop_na(race)
surgery = surgery %>% drop_na(age)
surgery = surgery %>% drop_na(bmi)
#Drop additional levels that now have no values
surgery$gender = droplevels(surgery$gender)
surgery$asa_status = droplevels(surgery$asa_status)
surgery$race = droplevels(surgery$race)
#View our numeric data distributions
num_data <- surgery[,sapply(surgery,is.numeric)]
hist.data.frame(num_data)
surgery$complication=revalue(surgery$complication,c("Yes"=1))
surgery$complication=revalue(surgery$complication,c("No"=0))
surgery$mort30=revalue(surgery$mort30,c("Yes"=1))
surgery$mort30=revalue(surgery$mort30,c("No"=0))
newPatient1=data.frame(ahrq_ccs="Colorectal resection",age=70,asa_status="IV-VI",bmi=27.9,baseline_cancer="Yes",baseline_cvd="Yes",baseline_dementia="No",baseline_diabetes="No",baseline_digestive="No",baseline_osteoart="No",baseline_psych="No",baseline_pulmonary="No")
newPatient2=data.frame(ahrq_ccs="Gastrectomy; partial and total",age=34,asa_status="III",bmi=22.9,baseline_cancer="No",baseline_cvd="Yes",baseline_dementia="No",baseline_diabetes="No",baseline_digestive="No",baseline_osteoart="Yes",baseline_psych="No",baseline_pulmonary="No")
surgery.bag = randomForest(mort30~ahrq_ccs+age+asa_status+bmi+baseline_cancer+baseline_cvd+baseline_dementia+baseline_diabetes+baseline_digestive+baseline_osteoart+baseline_psych+baseline_pulmonary,data=surgery,mtry=2,importance=T,cutoff=c(0.95,0.05)) #The cutoff is the probability for each group selection, probs of 1% or higher are classified as 'Mortality' occuring
test1=predict(surgery.bag,newPatient1,type="response")
Any suggestions/advice is extremely appreciated.
str
to verify this information. - Jeffrey Evans