2
votes


Quick question on R tree models. I want to produce a tree model on a lot of variables (mostly numeric or factor variables). One of the variables is Gender where the categories are male, female and unknown. When I use the tree or rpart function from the tree and rpart libraries I only get two branches from the Gender root. The unknown gender has being grouped with the females to form a single category. So the branches I am getting are Female+Unknown and Male. I checked the tree package pdf http://cran.r-project.org/web/packages/tree/tree.pdf and it says that the levels of an unordered factor are divided into two non-empty groups. The rpart function appears to very similar to the tree function in terms of handling factors with more than 2 levels.

My question is therefore are there any other functions or packages in R that will let me produce more than 3 branches from a single node or does anyone have any suggestions an other open source tools that will do the same. Let me know if you need any more information.

1
Did you post an almost exact question to R-Help earlier? If so, what was wrong with Achim's response, which covered what I show below in my answer.Gavin Simpson
No? When was that? Can you post the link and I'll check. Haven't posted anything on R-help for about a month or two now.Lorcan Treanor
Just went back and checked and it wasn't you. See this thread for details: stat.ethz.ch/pipermail/r-help/2012-September/324279.htmlGavin Simpson
Most decision trees use binary splits. I think you're confusing the treatment of an n-level ('polychotomous') categorical. The tree is perfectly well able to handle n levels simply by making further splits at subsequent nodes (as long as it improves the impurity measure). In your case, it can split into (Female, Unknown). The reason you might not see that split it if it doesn't affect the response variable.smci

1 Answers

2
votes

rpart() is perfectly capable for handling response with more than 2 categories. Try:

require(rpart)
mod <- rpart(Species ~ ., data = iris)
mod
plot(mod)
text(mod)

Which produces a tree with 3 terminal nodes when run using the default settings:

R> mod
n= 150 

node), split, n, loss, yval, (yprob)
      * denotes terminal node

    1) root 150 100 setosa (0.33333333 0.33333333 0.33333333)  
      2) Petal.Length< 2.45 50   0 setosa (1.00000000 0.00000000 0.00000000) *
      3) Petal.Length>=2.45 100  50 versicolor (0.00000000 0.50000000 0.50000000)  
        6) Petal.Width< 1.75 54   5 versicolor (0.00000000 0.90740741 0.09259259) *
        7) Petal.Width>=1.75 46   1 virginica (0.00000000 0.02173913 0.97826087) *

The recursive partitioning algorithm will stop building a tree when certain stopping rules are met (there is no point splitting if a node is already pure [of a single class], and by default a node has to have 20+ observations for it to be split, and will also stop splitting a given node if it has less than 7 observations, or if no further splits will improve the lack of fit by a factor of 0.01, and so on). Some of these can be controlled from the rpart.control() function.

From what limited information you have given us, I can only conclude that these defaults are inappropriate for your data set and you should adjust them accordingly, e.g.:

ctrl <- rpart.control(minsplit = 2, minbucket = 1, cp = 0.00001)
mod2 <- rpart(Species ~ ., data = iris, control = ctrl)
mod2
plot(mod2)
text(mod2)

Which for this exmaple data set produces a much larger tree:

R>     mod2
n= 150 

node), split, n, loss, yval, (yprob)
      * denotes terminal node

 1) root 150 100 setosa (0.33333333 0.33333333 0.33333333)  
   2) Petal.Length< 2.45 50   0 setosa (1.00000000 0.00000000 0.00000000) *
   3) Petal.Length>=2.45 100  50 versicolor (0.00000000 0.50000000 0.50000000)  
     6) Petal.Width< 1.75 54   5 versicolor (0.00000000 0.90740741 0.09259259)  
      12) Petal.Length< 4.95 48   1 versicolor (0.00000000 0.97916667 0.02083333)  
        24) Petal.Width< 1.65 47   0 versicolor (0.00000000 1.00000000 0.00000000) *
        25) Petal.Width>=1.65 1   0 virginica (0.00000000 0.00000000 1.00000000) *
      13) Petal.Length>=4.95 6   2 virginica (0.00000000 0.33333333 0.66666667)  
        26) Petal.Width>=1.55 3   1 versicolor (0.00000000 0.66666667 0.33333333)  
          52) Sepal.Length< 6.95 2   0 versicolor (0.00000000 1.00000000 0.00000000) *
          53) Sepal.Length>=6.95 1   0 virginica (0.00000000 0.00000000 1.00000000) *
        27) Petal.Width< 1.55 3   0 virginica (0.00000000 0.00000000 1.00000000) *
     7) Petal.Width>=1.75 46   1 virginica (0.00000000 0.02173913 0.97826087)  
      14) Petal.Length< 4.85 3   1 virginica (0.00000000 0.33333333 0.66666667)  
        28) Sepal.Length< 5.95 1   0 versicolor (0.00000000 1.00000000 0.00000000) *
        29) Sepal.Length>=5.95 2   0 virginica (0.00000000 0.00000000 1.00000000) *
      15) Petal.Length>=4.85 43   0 virginica (0.00000000 0.00000000 1.00000000) *

but is most likely highly over-fitted to the data.

That said, there are, of course, other packages that can fit trees to data sets that like rpart() can handle response with more than two levels. The main ones are listed on the Machine Learning & Statistical Learning Task View on CRAN, which you should consult. One such package is party.