Here is one approach you could perform with your data:
load data:
forkfold <- read.csv("forkfold.csv", row.names = 1)
the problem here is that the outcome variable is 0 in 97% of the cases while in the remaining 3% it is very close to zero.
length(forkfold$Vote_perc)
#output
7069
sum(forkfold$Vote_perc != 0)
#output
212
You described it a classification problem and I will treat it as such by converting it to a binary problem:
forkfold$Vote_perc <- ifelse(forkfold$Vote_perc != 0,
"one",
"zero")
Since the set is highly imbalanced using Accuracy
as selection metric is out of the question. Here i will try to maximize Sensitivity + Specificity
as described here by defining a custom evaluation function:
fourStats <- function (data, lev = levels(data$obs), model = NULL) {
out <- c(twoClassSummary(data, lev = levels(data$obs), model = NULL))
coords <- matrix(c(1, 1, out["Spec"], out["Sens"]),
ncol = 2,
byrow = TRUE)
colnames(coords) <- c("Spec", "Sens")
rownames(coords) <- c("Best", "Current")
c(out, Dist = dist(coords)[1])
}
I will specify this function in trainControl
:
train_control <- trainControl(method = "cv",
search = "random",
number = 5,
verboseIter=TRUE,
classProbs = T,
savePredictions = "final",
summaryFunction = fourStats)
set.seed(1)
xgb.mod <- train(Vote_perc~.,
data = forkfold,
trControl = train_control,
method = "xgbTree",
tuneLength = 50,
metric = "Dist",
maximize = FALSE,
scale_pos_weight = sum(forkfold$Vote_perc == "zero")/sum(forkfold$Vote_perc == "one"))
I will use the before defined Dist
metric in the fourStats
summary function. This metric should be minimized so maximize = FALSE
. I will use a random search over the tune space and 50 random sets of hyper parameter values will be tested (tuneLength = 50
).
I also set scale_pos_weight
parameter of the xgboost function. From the help of ?xgboost
:
scale_pos_weight, [default=1] Control the balance of positive and
negative weights, useful for unbalanced classes. A typical value to
consider: sum(negative cases) / sum(positive cases) See Parameters
Tuning for more discussion. Also see Higgs Kaggle competition demo for
examples: R, py1, py2, py3
I defined it as recommended sum(negative cases) / sum(positive cases)
After the model trains it will pick some hype parameters that minimize Dist
.
To evaluate the confusion matrix on the hold out predictions:
caret::confusionMatrix(xgb.mod$pred$pred, xgb.mod$pred$obs)
Confusion Matrix and Statistics
Reference
Prediction one zero
one 195 430
zero 17 6427
Accuracy : 0.9368
95% CI : (0.9308, 0.9423)
No Information Rate : 0.97
P-Value [Acc > NIR] : 1
Kappa : 0.4409
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.91981
Specificity : 0.93729
Pos Pred Value : 0.31200
Neg Pred Value : 0.99736
Prevalence : 0.02999
Detection Rate : 0.02759
Detection Prevalence : 0.08841
Balanced Accuracy : 0.92855
'Positive' Class : one
I'd say its not that bad.
You can do better if you tune the cutoff threshold of predictions, how to do this during the tuning process is described here. You can also use the out of fold predictions for tuning the cutoff threshold. Here I will show how to use pROC library for it:
library(pROC)
plot(roc(xgb.mod$pred$obs, xgb.mod$pred$one),
print.thres = TRUE)
The threshold shown on the image maximizes Sens + Spec
:
to evaluate the out of fold performance using this threshold:
caret::confusionMatrix(ifelse(xgb.mod$pred$one > 0.369, "one", "zero"),
xgb.mod$pred$obs)
#output
Confusion Matrix and Statistics
Reference
Prediction one zero
one 200 596
zero 12 6261
Accuracy : 0.914
95% CI : (0.9072, 0.9204)
No Information Rate : 0.97
P-Value [Acc > NIR] : 1
Kappa : 0.3668
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.94340
Specificity : 0.91308
Pos Pred Value : 0.25126
Neg Pred Value : 0.99809
Prevalence : 0.02999
Detection Rate : 0.02829
Detection Prevalence : 0.11260
Balanced Accuracy : 0.92824
'Positive' Class : one
So out of 212 non zero entities you detected 200.
To perform better you may try to pre process the data. OR use a better hyper parameter search routine like mlrMBO
package intended for use with mlr
. Or perhaps change the learner (I doubt you can top xgboost here tho).
One more note, if it is not paramount to get a high Sensitivity perhaps using "Kappa" as selection metric might provide a more satisfying model.
As a final note lets check the performance of the model with the default scale_pos_weight = 1
, using the already selected parameters:
set.seed(1)
xgb.mod2 <- train(Vote_perc~.,
data = forkfold,
trControl = train_control,
method = "xgbTree",
tuneGrid = data.frame(nrounds = 498,
max_depth = 3,
eta = 0.008833468,
gamma = 4.131242,
colsample_bytree = 0.4233169,
min_child_weight = 3,
subsample = 0.6212512),
metric = "Dist",
maximize = FALSE,
scale_pos_weight = 1)
caret::confusionMatrix(xgb.mod2$pred$pred, xgb.mod2$pred$obs)
#output
Confusion Matrix and Statistics
Reference
Prediction one zero
one 94 21
zero 118 6836
Accuracy : 0.9803
95% CI : (0.9768, 0.9834)
No Information Rate : 0.97
P-Value [Acc > NIR] : 3.870e-08
Kappa : 0.5658
Mcnemar's Test P-Value : 3.868e-16
Sensitivity : 0.44340
Specificity : 0.99694
Pos Pred Value : 0.81739
Neg Pred Value : 0.98303
Prevalence : 0.02999
Detection Rate : 0.01330
Detection Prevalence : 0.01627
Balanced Accuracy : 0.72017
'Positive' Class : one
So much worse at the default threshold of 0.5.
and the optimal threshold value:
plot(roc(xgb.mod2$pred$obs, xgb.mod2$pred$one),
print.thres = TRUE)
0.037 compared to the 0.369 obtained when we set scale_pos_weight
as recommended. However with the optimal threshold both approaches yield identical predictions.
family=binomial()
. Perhaps u meant:objective: "binary:logistic"
AlsoVote_perc
does not seem to be classes? Could you elaborate what you are attempting to do? – missuseobjective: "binary:logistic"
is xgboost notation, caret does use thefamily=binomial()
that I had above. FWIW, changing it tofamily=gaussian()
did NOT fix the problem. – Grufamily=binomial()
in carettrain
when running aglm
? This is because caret passes arguments to the underlying function. Your target variable is not fit for classification - consider changing it to 0 and other than zero if that suits you. If not consider performing regression. Though given the omnipresence of zeros I doubt it can perform good. – missusepredict()
function giving the same prediction for all observations persisted as well. – Gru