I am trying to use weka for classfying spam message and nonspam message.
With 100's of thousands of labeled spam messages, and another 100's of thousands labeled non-spam messages as a training data set, I use stringtowordvector
as filter to train the classier. The result of crossValidateModel
is quite good.But I want to use a standalone test set to evaluate the classier to make sure it is reliable to classify any other msg out of the training set.
My question:
I have to use the stringtowordvector
over the test data set too, to create a standalone .arff file, which is independent of the training arff file.The same word which presents in both the two data sets has 2 different attribute indexes, respectively, in these 2 .arff files. For example, the word "money" has matrix index 10
in the training .arff file,but within the testing .arff file it is indexed as the 50th
attribute.
I am worried the already trained classifier will mismatch all these words in the 2 data set,as they have different matrix indices.To be more specific,vector {1 1,2 1,3 5}
in the trainning .arff represents "i want to to to to to...."
, but in the testing .arff file this same vector represents "money does not not not not ....."
. So,how can this validation be reliable?
With crossValidateModel
, it uses the instance from the same arff file,so weka must match the indices with words correctly. My aim is to train it with a huge number of labeled datasets, then use it to classify any single unlabeled msg fed to it. Each time I want to classify one single msg, I have to covert this msg to an .arff file, which has the entirely different attributes list and matrix indices with the training .arff file. (I am not using the windows tool, I am using the weka .jar api in my program).
Any help?