5
votes

I am trying to implement xgboost on a classification data with imbalanced classes (1% of ones and 99% zeroes).

I am using binary:logistic as the objective function for classification.

According to my knowledge on xgboost - As the boosting starts building trees, the objective function is optimized iteratively achieving best performance at the end when all the trees are combined.

In my data due to imbalance in the classes, I am facing the problem of Accuracy Paradox. Where at the end of the model I am able to achieve great accuracy but poor precision and recall

I wanted a custom objective function that can optimize the model and returns a final xgboost model with best f-score. Or can I use any other objective functions that can return in best f-score ?

Where F-Score = (2 * Precision * Recall)/(Precision + Recall)

1

1 Answers

4
votes

I'm no expert in the matter, but I think this evaluation metric should do the job:

f1score_eval <- function(preds, dtrain) {
  labels <- getinfo(dtrain, "label")

  e_TP <- sum( (labels==1) & (preds >= 0.5) )
  e_FP <- sum( (labels==0) & (preds >= 0.5) )
  e_FN <- sum( (labels==1) & (preds < 0.5) )
  e_TN <- sum( (labels==0) & (preds < 0.5) )

  e_precision <- e_TP / (e_TP+e_FP)
  e_recall <- e_TP / (e_TP+e_FN)

  e_f1 <- 2*(e_precision*e_recall)/(e_precision+e_recall)

  return(list(metric = "f1-score", value = e_f1))
}

References:

https://github.com/dmlc/xgboost/issues/1152

http://xgboost.readthedocs.io/en/latest/parameter.html