How to classify text properly in weka given preprocessing is needed

Question

I need to classify some text using weka programmatically, but I am having trouble as the training data and the to-be-classified data need to be filtered (the same way) before being used with the classifier.

My approach to the problem is currently: Create an arff with training data with a string attribute and a class. Use StringToWordVector over the data set and save the filter for future use. Use Attributeselection filter over the resulting data and save filter for future use. Train the classifier with that data and save the classifier. Create a "Instances" with the same attributes as the arff and populate it with the Instance I want to classify with the value of class attribute missing. Load the StringToWordVector filter and use it to filter Instances. Load AttributeSlection filter and use it to filter the result. Load the classifier and classify the result.

It seems that StringToWordVector is working as I expected and using the same set of words with the new data as with the old. The problem is with AttributeSelection that tries, it seems, to run again not knowing that I just want it to use the attributes it already filtered before.

doxav doxav · Accepted Answer · 2014-07-13T23:43:03

Re-using same attribute selection setup: Attribute selection is a filter, you should use batch filtering method to be able to re-use it and get compatible data (http://weka.wikispaces.com/Use+Weka+in+your+Java+code#Batch%20filtering) => after declaring your filter & setup, you should call setInputFormat (ie. myfilter.setInputFormat(train)), use it on training data (Filter.useFilter(train, myfilter)), serialize the data if you want to use it later on test data. The setInputFormat(Instances) method always has to be the last call before the filter is applied.
Not re-running the attribute selection: use reduceDimensionality method of your AttributeSelection object (ie. myfilter.reduceDimensionality() would reduce the dimensionality to include only those attributes "chosen by the last run of attribute selection"). I think it is your main problem now.
If you want to re-use multiple filters (ie. StringToWordVector, standardization, selection), you should test a multi-filter solution.

StringToWordVector swv = new StringToWordVector(); AttributeSelection as = new AttributeSelection(); Standardize st = new Standardize(); MultiFilter mf = new MultiFilter(); Filter[] filters = {swv, st, as}; mf.setFilters(filters);

Xavier

How to classify text properly in weka given preprocessing is needed

1 Answers