4
votes

I'm trying to run a simple GBM classification model to benchmark performance against random forests and SVMs, but I'm having trouble getting the model to score correctly. It's not throwing an error, but the predictions are all NaN. I'm using the breast cancer data from mlbench. Here's the code:

library(gbm)
library(mlbench)
library(caret)
library(plyr)
library(ada)
library(randomForest)

data(BreastCancer)
bc <- BreastCancer
rm(BreastCancer)

bc$Id <- NULL
bc$Class <- as.factor(mapvalues(bc$Class, c("benign", "malignant"), c("0","1")))

index <- createDataPartition(bc$Class, p = 0.7, list = FALSE)
bc.train <- bc[index, ]
bc.test <- bc[-index, ]

model.gbm <- gbm(Class ~ ., data = bc.train, n.trees = 500)

pred.gbm <- predict(model.gbm, bc.test.ind, n.trees = 500, type = "response")

Can anyone help out with what I'm doing wrong? Also, am I going to have to transform the output of the predict function? I've read that that seems to be an issue with GBM predictions. Thanks.

2
This is a "feature" of the gbm package. See here for an explanation. (basically, gbm assumes that factor responses follow the multinomial distribution. If there are only 2 unique response values (whether character or numeric), then it assumes bernoulli.filups21

2 Answers

6
votes

I have experienced problems with giving a factor variable to gbm before. You can force the Class variable to be a character type instead of factor and that should do it.

bc$Class <- as.factor(mapvalues(bc$Class, c("benign", "malignant"), c("0","1")))
bc$Class <- as.character(bc$Class)

Your code should run fine from there, just make sure you call bc.test (not bc.test.ind) in predict.

Here's a summary of the predicted values I got after making those changes

> summary(pred.gbm)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.222   0.222   0.231   0.346   0.573   0.579 

One last thing, I would recommend setting a seed (e.g. using set.seed()) before calling createDataPartition(). Otherwise you will get different training and test sets every time you run your code.

0
votes

You can just convert the label into 0 and 1, but store the labels first for comparison:

library(gbm)
library(mlbench)
library(caret)

data(BreastCancer)
bc <- BreastCancer

bc$Id <- NULL
# store the actual labels
labels = bc$Class
bc$Class <- as.numeric(bc$Class)-1
index <- createDataPartition(bc$Class, p = 0.7, list = FALSE)
bc.train <- bc[index, ]
bc.test <- bc[-index, ]

model.gbm <- gbm(Class ~ ., data = bc.train, n.trees = 500,distribution = "bernoulli")

pred.gbm <- predict(model.gbm, bc.test, n.trees = 500, type = "response")

Since there's only two classes, we can get back the labels by calling out the first level of the label if p <= 0.5 and vice versa:

predicted_labels = levels(labels)[1+(pred.gbm>0.5)]

We get the actual test labels out and make a confusion matrix to see it worked correctly:

test_labels = labels[-index]

table(predicted_labels,test_labels)
                test_labels
predicted_labels benign malignant
       benign       129         2
       malignant      3        75