I am an actuarial student preparing for an upcoming predictive analytics exam in December. Part of an exercise is to build a model using boosting with caret
and xgbTree
. See the code below, the caravan dataset is from the ISLR
package:
library(caret)
library(ggplot2)
set.seed(1000)
data.Caravan <- read.csv(file = "Caravan.csv")
data.Caravan$Purchase <- factor(data.Caravan$Purchase)
levels(data.Caravan$Purchase) <- c("No", "Yes")
data.Caravan.train <- data.Caravan[1:1000, ]
data.Caravan.test <- data.Caravan[1001:nrow(data.Caravan), ]
grid <- expand.grid(max_depth = c(1:7),
nrounds = 500,
eta = c(.01, .05, .01),
colsample_bytree = c(.5, .8),
gamma = 0,
min_child_weight = 1,
subsample = .6)
control <- trainControl(method = "cv",
number = 4,
classProbs = TRUE,
sampling = c("up", "down"))
caravan.boost <- train(formula = Purchase ~ .,
data = data.Caravan.train,
method = "xgbTree",
metric = "Accuracy",
trControl = control,
tuneGrid = grid)
The definitions in expand.grid
and trainControl
were specified by the problem, but I keep getting an error:
Error: sampling methods are only implemented for classification problems
If I remove the sampling method from trainControl
, I get a new error that states "Metric Accuracy not applicable for regression models". If I remove the Accuracy metric, I get an error stating
cannnot compute class probabilities for regression" and "Error in names(res$trainingData) %in% as.character(form[[2]]) : argument "form" is missing, with no default"
Ultimately the problem is that caret is defining the problem as regression, not classification, even though the target variable is set as a factor variable and classProbs
is set to TRUE. Can someone explain how to tell caret to run classification and not regression?
dput(head(data.Caravan, 20))
to your question? This will give us the first 20 records of your source data. That way we can run your code with your data. For more info read this post on reproducible examples – phiver