XGBoost Model performance

Question

I am trying to use XGBoost for classification. I am pretty doubtful on its accuracy. I have applied it with default parameters and the precision is 100%.

xg_cl_default = xgb.XGBClassifier()
xg_cl_default.fit(trainX, trainY)
preds = xg_cl_default.predict(testX)

precision_score(testY,preds)
# 1.0

However my data is imbalance so I use scale_pos_weight parameter along with few other parameters given below:

ratio = int(df_final.filter(col('isFraud')==0).count()/df_final.filter(col('isFraud')==1).count())    

xg_cl = xgb.XGBClassifier(scale_pos_weight = ratio, n_estimators=50)

eval_set = [(valX, valY.values.ravel())]
xg_cl.fit(trainX, trainY.values.ravel(),eval_metric="error", early_stopping_rounds=10,eval_set=eval_set, verbose=True)

preds = xg_cl_default.predict(testX)

precision_score(testY,preds)
# 1.0

In both the cases my precision is 100% and Recall is 99%. This is not acceptable for me as data is highly imbalance.

lightalchemist lightalchemist · Accepted Answer · 2020-06-18T10:41:18

For imbalanced datasets, a more appropriate evaluation metric is the area under the precision-recall curve, so set eval_metric="aucpr" instead.

Also, you should tune the parameters of XGBoost using cross-validation, again with the area under the precision-recall curve as the evaluation metric. Cross-validation can be done in a variety of ways and a quick search should get you a number of code examples. Based on the code you shared, unless your problem is trivial, it is unlikely that you can get a meaningful model without careful tuning of the parameters.

Lastly, you can plot the confusion matrix using Scikit-Learn from the true labels and the predicted labels to get a sense of whether the model is making meaningful predictions.

XGBoost Model performance

2 Answers