error while predicting test data in xgboost python

Question

I am using xgboost python to perform text classification

Below is the trainset I am considering

itemid       description                                            category
11802974     SPRO VUH3C1 DIFFUSER VUH1 TRIPLE Space heaters    Architectural Diffusers
10688548     ANTIQUE BRONZE FINISH PUSHBUTTON  switch           Door Bell Pushbuttons
9836436     Descente pour Cable tray fitting and accessories    Tray Cable Drop Outs

I am constructing document term matrix of description using Sckit learn's counvectorizer which generate scipy matrix(As I have huge data of 1.1million I am using sparse representation to reduce space complexity) using below code

countvec = CountVectorizer()
documenttermmatrix=countvec.fit_transform(trainset['description'])

After that I will apply feature selection for the above matrix using

 fs = feature_selection.SelectPercentile(feature_selection.chi2, percentile=40)
 documenttermmatrix_train= fs.fit_transform(documenttermmatrix,y1_train)

I am using xgboost classifier to train the model

model = XGBClassifier(silent=False)

model.fit(documenttermmatrix_train, y_train,verbose=True)

Below is the testset i am considering

itemid      description                       category
9836442     TRIPLE Space heaters              Architectural Diffusers
13863918    pushbutton switch                  Door Bell Pushbuttons

I am constructing seperate matrix for test set as I did it for train set using below code

 documenttermmatrix_test=countvec.fit_transform(testset['description'])

while preicting testset Xgboost expects all the features of trainset to be in testset but it is not possible( sparse matrix represents only non-zero entries)

I cannot combine train and test set into single dataset as i need to do feature selection only for trainset

Can anyone tell how can I approach further?

Vivek Kumar Vivek Kumar · Accepted Answer · 2017-11-23T05:30:17

Instead of using countvec.fit_transform() on testset, only use transform().

Change this line:

documenttermmatrix_test=countvec.fit_transform(testset['description'])

To this:

documenttermmatrix_test=countvec.transform(testset['description'])

This will make sure that those features which are present in training set are only taken from the test set and if not available, put 0 there.

fit_transform() will forget the previous trained data and make new matrix which can have different features than previous output. Hence the error.

error while predicting test data in xgboost python

3 Answers