I want to train a random forest to make a categorical prediction. If I want to include a fixed set of independent variables in the prediction model (e.g. x1, x2, and x3 in Y~.+x1+x2+x3
), but exclude them from the set of independent variables (represented by . in the example) that can be used to partition the data/create branches/trees in the forest, is there a simple way to do this using caret
, grf
, or another package in R?
Here's an example: If I wanted to predict which flowers had sepal width over 3.2 in the iris dataset, but I wanted to condition on flower species when deciding whether to create a new branch while excluding flower species as a possible variable to split on. Imagine that I know that flower species is a good predictor of sepal width, but I want to know what other factors predict sepal width, conditional on species:
data(iris)
d <- iris
d$sepal_width_over3point2<-as.factor(d$Sepal.Width>3.2)
d$Type1<-as.numeric(d$Species=='versicolor')
d$Type2<-as.numeric(d$Species=='virginica')
d$Type3<-as.numeric(d$Species=='setosa')
d<-subset(d,select=-c(Species,Sepal.Width))
## Set parameters to train models
# Run algorithms using 10-fold cross validation
control <- trainControl(method="cv", number=10)
metric <- "Accuracy"
# Random Forest
set.seed(11)
rf <- train(sepal_width_over3point2~.+Type1+Type2+Type3, data=d, method="rf", metric=metric, trControl=control)
print(rf)
example_varImp_rf<-varImp(rf)
When I look at the variable importance in this model, I'd like to know that the estimates for the other parameters (Sepal.length, Petal.length, and Petal.width) are conditional on flower Type1, Type2, and Type3, but exclude these variables as possible variables to branch on. Is there a way to tell the random forest to ignore these three variables as possible splits?