1
votes

I'm using random forest to estimate the importance (%IncMSE) of a number of predictors. Afterwards, I use a combination of all predictors but one, and I calculate their importance again. RandomForest works in theory even when only two predictors are used, but I'm wondering if it makes sense if the mtry parameter is set to 1. If I'm correct, this would means that at each iteration random forest fits the tree with only one of the two predictors. If I only had to make a model with only two predictors, Random Forest would haven't been my first choice, but I don't know if it sounds sensible to use two different stats for a similar task in the same paper. Do you think I should also use Random Forest with two predictors too? Or another model would be more appropriate? Thanks

2
What do you mean that it works with only two predictors? It has no such inherent limitation. The R implementation you seem to refer to doesn't have this limitation either.Sean Owen
Yes I know it works indeed. My question wasn't about the feasibility but about the reliability of the output. As far as I know random forest is generally used when having many predictor variables, I've never seen it used for only two predictors. Also, as it works by sampling a bunch of variables at each iterations I was wondering if it is still valid if it only sample one of the two variables each time.Oritteropus
OK suggested an edit.Sean Owen
This is also a good question for datascience.stackexchange.com -- not sure if it can be migrated by us though.Sean Owen

2 Answers

1
votes

Yes, this is normal operation for random forests. At each node, it may consider only a subset of all possible features to split on. If you have 2 features, yes, it would randomly use one or the other each time to fit.

This is actually quite important if you are fitting every tree on the same training set. Without this, every tree would be identical. That is, you get different trees by considering a different set of features or a different subset of the data.

Whether RDF is a good fit depends on the nature of your data. Do you have categorical features? yes, it is a good choice compared to other algorithms. Are the decision boundaries probably not axis-aligned? then not a good choice. While variants of RDF can compute non-aligned decision boundaries, R's won't.

Try it and see what kind of evaluation metrics you get?

0
votes

I think you're right to be worried.

The way some of these forest algorithms work is by gauging the number of times a variable is chosen to make a split, and at what point in the tree.

If variable selection is completely random, then this stat will be completely meaningless.

If the stat is calculated using the the scrambling method - randomizing the values of a feature and calculating the difference in performance - this might not be as much of a problem, but it's hard to tell.

You might want to use another method to gauge importance, or at least for comparison.