Add conditioning variables to a random forest model in R

Question

I want to train a random forest to make a categorical prediction. If I want to include a fixed set of independent variables in the prediction model (e.g. x1, x2, and x3 in Y~.+x1+x2+x3), but exclude them from the set of independent variables (represented by . in the example) that can be used to partition the data/create branches/trees in the forest, is there a simple way to do this using caret, grf, or another package in R?

Here's an example: If I wanted to predict which flowers had sepal width over 3.2 in the iris dataset, but I wanted to condition on flower species when deciding whether to create a new branch while excluding flower species as a possible variable to split on. Imagine that I know that flower species is a good predictor of sepal width, but I want to know what other factors predict sepal width, conditional on species:

data(iris)
d <- iris

d$sepal_width_over3point2<-as.factor(d$Sepal.Width>3.2)
d$Type1<-as.numeric(d$Species=='versicolor')
d$Type2<-as.numeric(d$Species=='virginica')
d$Type3<-as.numeric(d$Species=='setosa')

d<-subset(d,select=-c(Species,Sepal.Width))


## Set parameters to train models
# Run algorithms using 10-fold cross validation
control <- trainControl(method="cv", number=10)
metric <- "Accuracy"

# Random Forest
set.seed(11)
rf <- train(sepal_width_over3point2~.+Type1+Type2+Type3, data=d, method="rf", metric=metric, trControl=control)
print(rf)

example_varImp_rf<-varImp(rf)

When I look at the variable importance in this model, I'd like to know that the estimates for the other parameters (Sepal.length, Petal.length, and Petal.width) are conditional on flower Type1, Type2, and Type3, but exclude these variables as possible variables to branch on. Is there a way to tell the random forest to ignore these three variables as possible splits?

It's easier to help you if you include a simple reproducible example with sample input and desired output that can be used to test and verify possible solutions. — MrFlick
What's the difference between "include in prediction model" and "include them for creating branches in the forest"? — Ben Reiniger
Thanks @MrFlick -- I've added an example that I hope clarifies my question -- Sorry for any confusion; I'm still learning how these tools work. — Blake Heller
@BenReiniger I'm trying to use a random forest to predict a binary characteristic Y that branches on a subset of variables (x_1....x_n), but considers the predictive power of each of these x variables conditional on another subset of z variabels (say z1,z2,z3). I want to know which x's are important predictors, but conditional on the z's (but I don't want the forest to split on the z's). I tried to add an example to my question that clarifies what I'm trying to accomplish. Thanks for any insight or advice you can offer! — Blake Heller

Ben Reiniger Ben Reiniger · Accepted Answer · 2021-01-21T22:49:40

That would require your node splits to have one threshold for each flower species, which would be more computationally expensive than most tree learners. I don't know of any package that implements this.

One possible workaround is to do some feature engineering. In this case, where your condition on is a smallish categorical, you could standardize each feature relative to their flower species, so that a split would be something like "sepal length is at least 20% higher than species average" or "sepal length is at least one (species) standard deviation higher than species average."

Add conditioning variables to a random forest model in R

1 Answers