0
votes

we are modelling a decision tree using both continous and binary inputs. We are analyzing weather effects on biking behavior. A linear regression suggests that "rain" has a huge impact on bike counts. Our rain variable is binary showing hourly status of rain.

Using rpart to create a decision tree does not include "rain" as a node, although we expect it to be very decisive on the number of bikes. This might be due to the classification of the rain variable. Rpart seems to prefer to use continous variables (like temperature) for decision nodes.

Is there anything we should know about how rpart determines whether to use continous or binary variables as decision node? Is it possible to control this selection of variables?

library("rpart") fit <- rpart(bikecount ~ df.weather$temp+df.weather$weekday+df.weather$rain, data=training.data, method="class")

1
Continuous variable is always discretized according to optimal cutoff value. Maybe that's the reason. - Andrej
Actually, we use intervals for the bike count variable. So for example the first interval is 0 to 499 observations named "interval1"… It seems to discretize temperature according to optimal cutoff value. But is that a problem? Shouldn't there still be "rain" as one of the first nodes? - Bukowski

1 Answers

1
votes

Function rpart implements the CART algorithm of Breiman, Friedman, Olshen and Stone (1984), which is known to suffer from biased variable selection. I.e., given 2 or more variables that are equally predictive of the outcome, the variable with the largest number of unique values is most likely to be selected for splitting. See for example Loh and Shih (1997); Hothorn, Hornik & Zeileis (2006).

Unbiased recursive partitioning methods separate selection of 1) the splitting variable and 2) the splitting value, which solves this variable selection bias. Unbiased recursive partitioning has been implemented in the R package partykit.

If the code you provide above works for function rpart (as it is unclear to me why the predictor variables in formula include $, the response variable does not, while the data argument has been specified), you should be able to fit an unbiased classification tree as follows:

library("partykit")
ct <- ctree(bikecount ~ df.weather$temp + df.weather$weekday + df.weather$rain, 
            data=training.data)
plot(ct)

References

Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1994). Classification and regression trees. Wadsworth, Monterey, CA.

Hothorn, T., Hornik, K., & Zeileis, A. (2006). Unbiased recursive partitioning: A conditional inference framework. Journal of Computational and Graphical statistics, 15(3), 651-674.

Loh, W. Y., & Shih, Y. S. (1997). Split selection methods for classification trees. Statistica Sinica 7(4), 815-840.