Random Forest with two predictors

Question

I'm using random forest to estimate the importance (%IncMSE) of a number of predictors. Afterwards, I use a combination of all predictors but one, and I calculate their importance again. RandomForest works in theory even when only two predictors are used, but I'm wondering if it makes sense if the mtry parameter is set to 1. If I'm correct, this would means that at each iteration random forest fits the tree with only one of the two predictors. If I only had to make a model with only two predictors, Random Forest would haven't been my first choice, but I don't know if it sounds sensible to use two different stats for a similar task in the same paper. Do you think I should also use Random Forest with two predictors too? Or another model would be more appropriate? Thanks

What do you mean that it works with only two predictors? It has no such inherent limitation. The R implementation you seem to refer to doesn't have this limitation either. — Sean Owen
Yes I know it works indeed. My question wasn't about the feasibility but about the reliability of the output. As far as I know random forest is generally used when having many predictor variables, I've never seen it used for only two predictors. Also, as it works by sampling a bunch of variables at each iterations I was wondering if it is still valid if it only sample one of the two variables each time. — Oritteropus
This is also a good question for datascience.stackexchange.com -- not sure if it can be migrated by us though. — Sean Owen

Sean Owen Sean Owen · Accepted Answer · 2014-06-27T11:05:42

Yes, this is normal operation for random forests. At each node, it may consider only a subset of all possible features to split on. If you have 2 features, yes, it would randomly use one or the other each time to fit.

This is actually quite important if you are fitting every tree on the same training set. Without this, every tree would be identical. That is, you get different trees by considering a different set of features or a different subset of the data.

Whether RDF is a good fit depends on the nature of your data. Do you have categorical features? yes, it is a good choice compared to other algorithms. Are the decision boundaries probably not axis-aligned? then not a good choice. While variants of RDF can compute non-aligned decision boundaries, R's won't.

Try it and see what kind of evaluation metrics you get?

Random Forest with two predictors

2 Answers