I am using xgboost python to perform text classification
Below is the trainset I am considering
itemid description category
11802974 SPRO VUH3C1 DIFFUSER VUH1 TRIPLE Space heaters Architectural Diffusers
10688548 ANTIQUE BRONZE FINISH PUSHBUTTON switch Door Bell Pushbuttons
9836436 Descente pour Cable tray fitting and accessories Tray Cable Drop Outs
I am constructing document term matrix of description using Sckit learn's counvectorizer which generate scipy matrix(As I have huge data of 1.1million I am using sparse representation to reduce space complexity) using below code
countvec = CountVectorizer()
documenttermmatrix=countvec.fit_transform(trainset['description'])
After that I will apply feature selection for the above matrix using
fs = feature_selection.SelectPercentile(feature_selection.chi2, percentile=40)
documenttermmatrix_train= fs.fit_transform(documenttermmatrix,y1_train)
I am using xgboost classifier to train the model
model = XGBClassifier(silent=False)
model.fit(documenttermmatrix_train, y_train,verbose=True)
Below is the testset i am considering
itemid description category
9836442 TRIPLE Space heaters Architectural Diffusers
13863918 pushbutton switch Door Bell Pushbuttons
I am constructing seperate matrix for test set as I did it for train set using below code
documenttermmatrix_test=countvec.fit_transform(testset['description'])
while preicting testset Xgboost expects all the features of trainset to be in testset but it is not possible( sparse matrix represents only non-zero entries)
I cannot combine train and test set into single dataset as i need to do feature selection only for trainset
Can anyone tell how can I approach further?