I am trying to use xgboost to run -using python - on a classification problem, where I have the data in a numpy matrix X (rows = observations & columns = features) and the labels in a numpy array y. Because my data are sparse, I would like to make it run using a sparse version of X, but it seems I am missing something as an error occurs.
Here is what I do :
# Library import
import numpy as np
import xgboost as xgb
from xgboost.sklearn import XGBClassifier
from scipy.sparse import csr_matrix
# Converting to sparse data and running xgboost
X_csr = csr_matrix(X)
xgb1 = XGBClassifier()
xgtrain = xgb.DMatrix(X_csr, label = y ) #to work with the xgb format
xgtest = xgb.DMatrix(Xtest_csr)
xgb1.fit(xgtrain, y, eval_metric='auc')
dtrain_predictions = xgb1.predict(xgtest)
etc...
Now I get an error when trying to fit the classifier :
File ".../xgboost/python-package/xgboost/sklearn.py", line 432, in fit
self._features_count = X.shape[1]
AttributeError: 'DMatrix' object has no attribute 'shape'
Now, I looked for a while on where it could come from, and believe it has to do with the sparse format I wish to use. But what it is, and how I could fix it, I have no clue.
I would welcome any help or comments ! Thank you very much
X
? What doesxgb
say about using sparse matrix? They often aren't drop in replacements. – hpaulj