0
votes

I'm trying to plot J48 decision tree based on string values attributes and predict the target variable(categorical), I have seen many examples plotting decision tree based on numerical values but i haven't come across based on strings.

Here is sample data set, J48 decision tree works fine.

library(RWeka)
library(party)

MyData2 <- read.csv(file="iris.csv", header=TRUE, sep=",")
m3 <- J48(species~ ., data = MyData2)`enter code here`
if(require("party", quietly = TRUE)) plot(m3)


sepal_length    sepal_width petal_length    petal_width     species
5.1           3.5             1.4             0.2            setosa
4.9           3               1.4             0.2            setosa
7             3.2             4.7             1.4            versicolor
6.4           3.2             4.5             1.5            versicolor
6.3           3.3             6               2.5            virginica
5.8           2.7             5.1             1.9            virginica

If i rename the header sepal_length, sepal_width and to sepal_color and have values as "white", "black" with different combinations of colors to setosa, versicolor and virginca, how do i plot decision tree and predict the target species value.

Suppose if i have data set like below,

 sepal_color    sepal_color petal_color petal_color species
    white         black       white        black    setosa
    white         yellow      white        yellow   versicolor
    green         brown       green        brown    virginica
2

2 Answers

1
votes

If the string variables represent levels of a categorical variable, then they should be turned into a factor() in R. Then, J48() can deal with these appropriately (just like other regression functions).

However, if the strings contain free text, then these are not supported directly. A feature preprocessing to some numeric or factor variable would be necessary before calling J48().

As an example for classification based on categorical variables, let's turn the variables in the iris data into factors with three levels low, medium, high (cutting each variable into three equally-sized groups at the corresponding quantiles):

## load data and convert to factors via cut()
data("iris", package = "datasets")
for(i in 1:4) iris[[i]] <- cut(iris[[i]],
  quantile(iris[[i]], 0:3/3),
  labels = c("low", "medium", "high")
)
head(iris, 3)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          low        high          low         low  setosa
## 2          low      medium          low         low  setosa
## 3          low      medium          low         low  setosa

## fit and plot J4.8 tree
j48 <- J48(Species ~ ., data = iris)
plot(j48)

J4.8 tree

0
votes

Does the algorithm allow string regressors? I tried it and it threw an error. With strings you could try one-hot encoding e.g "White"=1; "Black"=2 etc. e.g.

MyData2 <- iris
MyData2$Colour <- 2
MyData2[MyData2$Species == "setosa", ]$Colour <- 1
m3 <- J48(Species~ ., data = MyData2)
plot(m3)