2
votes

I am using xgboost python to perform text classification

Below is the trainset I am considering

itemid       description                                            category
11802974     SPRO VUH3C1 DIFFUSER VUH1 TRIPLE Space heaters    Architectural Diffusers
10688548     ANTIQUE BRONZE FINISH PUSHBUTTON  switch           Door Bell Pushbuttons
9836436     Descente pour Cable tray fitting and accessories    Tray Cable Drop Outs

I am constructing document term matrix of description using Sckit learn's counvectorizer which generate scipy matrix(As I have huge data of 1.1million I am using sparse representation to reduce space complexity) using below code

countvec = CountVectorizer()
documenttermmatrix=countvec.fit_transform(trainset['description'])

After that I will apply feature selection for the above matrix using

 fs = feature_selection.SelectPercentile(feature_selection.chi2, percentile=40)
 documenttermmatrix_train= fs.fit_transform(documenttermmatrix,y1_train)

I am using xgboost classifier to train the model

model = XGBClassifier(silent=False)

model.fit(documenttermmatrix_train, y_train,verbose=True)

Below is the testset i am considering

itemid      description                       category
9836442     TRIPLE Space heaters              Architectural Diffusers
13863918    pushbutton switch                  Door Bell Pushbuttons

I am constructing seperate matrix for test set as I did it for train set using below code

 documenttermmatrix_test=countvec.fit_transform(testset['description'])

while preicting testset Xgboost expects all the features of trainset to be in testset but it is not possible( sparse matrix represents only non-zero entries)

I cannot combine train and test set into single dataset as i need to do feature selection only for trainset

Can anyone tell how can I approach further?

3

3 Answers

3
votes

Instead of using countvec.fit_transform() on testset, only use transform().

Change this line:

documenttermmatrix_test=countvec.fit_transform(testset['description'])

To this:

documenttermmatrix_test=countvec.transform(testset['description'])

This will make sure that those features which are present in training set are only taken from the test set and if not available, put 0 there.

fit_transform() will forget the previous trained data and make new matrix which can have different features than previous output. Hence the error.

2
votes

You have to use fit_transform on train set, but only transform on your test set. Therefore the default output of countvectorizer is a csr matrix. It doesn't work with XGBClissifier, you have to convert it to csc matrix. Simply do: X = csc_matrix(X).

0
votes

There is no easy way around this issue, common as it is. XGBoost and other tree-based models can handle test sets with more variables than the training set (since it can ignore them), but never fewer (since it's expecting to make decisions on them). That being the case, you have some options, in descending order of desirability / likelihood to solve your problem:

  1. Don't use a sparse matrix. Unless you're building this model inside a real-time application or otherwise prohibitive production environment, the easiest thing to do is use an ordinary matrix that will keep columns of zeros.

  2. Look at how you're partitioning your data. It may be that there are only one or two factors with an unbalanced split, in which case you might be able to get more equal representation by playing around with scikit learn's train_test_split() functionality.

  3. Prune the data yourself. Similar to option 2, if you think a couple entries are the culprits, and that their removal wouldn't hurt your model, you can try removing them from the original dataset. This is, of course, the least desirable option, but if they really are that few and far between, they won't affect the predictive power of your model.

But broadly this is a sign of an unhealthy dataset. I would also advise looking at other ways you might bin or categorize your data into fewer groups so that this isn't a problem.