Is it possible to use XGBoost for multi-label classification? Now I use OneVsRestClassifier
over GradientBoostingClassifier
from sklearn
. It works, but use only one core from my CPU. In my data I have ~45 features and the task is to predict about 20 columns with binary (boolean) data. Metric is mean average precision (map@7). If you have a short example of code to share, that would be great.
3 Answers
One possible approach, instead of using OneVsRestClassifier
which is for multi-class tasks, is to use MultiOutputClassifier
from the sklearn.multioutput
module.
Below is a small reproducible sample code with the number of input features and target outputs requested by the OP
import xgboost as xgb
from sklearn.datasets import make_multilabel_classification
from sklearn.model_selection import train_test_split
from sklearn.multioutput import MultiOutputClassifier
from sklearn.metrics import accuracy_score
# create sample dataset
X, y = make_multilabel_classification(n_samples=3000, n_features=45, n_classes=20, n_labels=1,
allow_unlabeled=False, random_state=42)
# split dataset into training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)
# create XGBoost instance with default hyper-parameters
xgb_estimator = xgb.XGBClassifier(objective='binary:logistic')
# create MultiOutputClassifier instance with XGBoost model inside
multilabel_model = MultiOutputClassifier(xgb_estimator)
# fit the model
multilabel_model.fit(X_train, y_train)
# evaluate on test data
print('Accuracy on test data: {:.1f}%'.format(accuracy_score(y_test, multilabel_model.predict(X_test))*100))
There are a couple of ways to do that, one of which is the one you already suggested:
1.
from xgboost import XGBClassifier
from sklearn.multiclass import OneVsRestClassifier
# If you want to avoid the OneVsRestClassifier magic switch
# from sklearn.multioutput import MultiOutputClassifier
clf_multilabel = OneVsRestClassifier(XGBClassifier(**params))
clf_multilabel
will fit one binary classifier per class, and it will use however many cores you specify in params
(fyi, you can also specify n_jobs
in OneVsRestClassifier
, but that eats up more memory).
2.
If you first massage your data a little by making k
copies of every data point that has k
correct labels, you can hack your way to a simpler multiclass problem. At that point, just
clf = XGBClassifier(**params)
clf.fit(train_data)
pred_proba = clf.predict_proba(test_data)
to get classification margins/probabilities for each class and decide what threshold you want for predicting a label.
Note that this solution is not exact: if a product has tags (1, 2, 3)
, you artificially introduce two negative samples for each class.
You can add a label to each class you want to predict. For example if this is your data:
X1 X2 X3 X4 Y1 Y2 Y3
1 3 4 6 7 8 9
2 5 5 5 5 3 2
You can simply reshape your data by adding a label to the input, according to the output, and xgboost
should learn how to treat it accordingly, like so:
X1 X2 X3 X3 X_label Y
1 3 4 6 1 7
2 5 5 5 1 5
1 3 4 6 2 8
2 5 5 5 2 3
1 3 4 6 3 9
2 5 5 5 3 2
This way you will have a 1-dimensional Y
, but you can still predict many labels.