1
votes

I am trying to build a decision tree for a prediction model on the following dataset:

enter image description here

And here is my code:

fitTree = rpart(classLabel ~ from_station_id + start_day + start_time
            + gender +  age, method = "class", data=d)
fancyRpartPlot(fitTree)

But the result decision tree has used only one of the attributes (from_station_id) as the 'splitting attribute' and did not care about the values of other attributes (start_day, start_time, gender, age). Here is the result:

Click to enlarge.

enter image description here

What am I doing wrong?

1
From your data snippet it appears that in the case of from_station_id == 131 the 'from_station_id' is much more predictive of the 'classLabel == 2' than any of the other variables. And from your output it appears that is also the case for other stationIDs. So what you are doing wrong is assuming that rpart will use all variables even if they are less predictive. - IRTFM
What about building another model, either by dropping the column from_station_id, or create a new column from_station_id_grp grouped into a less predictive values (you can use Hmisc::cut2 function for this, g=5 parameter)? - knb

1 Answers

1
votes

Your syntax looks correct. Based on the snippet of your dataset, classLabel and from_station_id may be closely correlated (and maybe gender, too?). In this case, from_station_id will be the best predictor for your classLabel, and the other variables are just not informative (or are also correlated but being masked), and will not show up on the tree. Try:

summary.rpart(fitTree)

This will show you better how the splits were made and the variable importance, which can help you evaluate masking. You should avoid correlated predictors, as they result in masking and can interfere with interactions.

If you are only seeing from_station_id in the summary, then you know it is ignoring the other variables, but I am not sure why it would.