questions about using a standalone dataset to validate text classification with weka

Question

I am trying to use weka for classfying spam message and nonspam message.

With 100's of thousands of labeled spam messages, and another 100's of thousands labeled non-spam messages as a training data set, I use stringtowordvector as filter to train the classier. The result of crossValidateModel is quite good.But I want to use a standalone test set to evaluate the classier to make sure it is reliable to classify any other msg out of the training set.

My question:

I have to use the stringtowordvector over the test data set too, to create a standalone .arff file, which is independent of the training arff file.The same word which presents in both the two data sets has 2 different attribute indexes, respectively, in these 2 .arff files. For example, the word "money" has matrix index 10 in the training .arff file,but within the testing .arff file it is indexed as the 50th attribute.

I am worried the already trained classifier will mismatch all these words in the 2 data set,as they have different matrix indices.To be more specific,vector {1 1,2 1,3 5} in the trainning .arff represents "i want to to to to to....", but in the testing .arff file this same vector represents "money does not not not not .....". So,how can this validation be reliable?

With crossValidateModel, it uses the instance from the same arff file,so weka must match the indices with words correctly. My aim is to train it with a huge number of labeled datasets, then use it to classify any single unlabeled msg fed to it. Each time I want to classify one single msg, I have to covert this msg to an .arff file, which has the entirely different attributes list and matrix indices with the training .arff file. (I am not using the windows tool, I am using the weka .jar api in my program). Any help?

Can you just elaborate as to why the vector represents the string that way for the two examples you provided here. Why have you repeated the word "to" and "not" so many times? — London guy
Consider also that cross validation is generally more robust at evaluating the quality of a classifier than manually split train-test sets. This is because it does train and test on separate subsets of your whole dataset several times, and then averages the results, ensuring that you do well in general rather than by chance. See <en.wikipedia.org/wiki/…> — kaz
to Abhishek Shivkumar ,sorry for the unclear examples.I just want to express:the train set and test set represent strings in different way.e.g.,{1 1,2 1,3 5}->"i want to to to to to",1->i,2->want,3->to,so "to" repeats 5 times. But may be in the test set,.,{1 1,2 1,3 5}->"money does not not not not not ",where 1->money,2->does,3->not,here "not" repeats 5 times, the same vector,but different string,how can weka validate the classfication over this test set?I wish i am clear,tks — basketballnewbie
to kaz:tks,i know cross validation will do well.But my aim is for classify new instance in the future(e.g. user generated content,rather than the contents we already have),so use a trained classifier to judge msgs in my productive system — basketballnewbie

Tomer Aberbach Tomer Aberbach · Accepted Answer · 2017-07-20T21:26:16

You need to create a feature map file from you train set in order to achieve what you want. A feature map file is usually in the following format:

someword:1
someotherword:2
yetanotherword:3
...

This effectively maps every word to some index. So what you'd do is iterate over all of the files in your train set and map every word which exists in your train set to a unique id which will represent the word's index in your ARFFs.

So let's say your train set contains one file with the words "i want to to to to to make money" then your feature map would like this:

i:1
want:2
to:3
make:4
money:5

And your attributes in your ARFFs would look like this:

@ATTRIBUTE i NUMERIC
@ATTRIBUTE want NUMERIC
@ATTRIBUTE to NUMERIC
@ATTRIBUTE make NUMERIC
@ATTRIBUTE money NUMERIC

Where each attribute represents the number of times the word showed up in an email.

Then, if you want to make an ARFF for a test set you'd iterate over all of the files in your test set and for every word you come across, look it up in the feature map. If the word is in your feature map you know to increment the value of the attribute at the index to which that word is mapped. If the word is not in your feature map then you ignore it because your classifier wasn't trained on the word and doesn't even know the word exists.

This will keep the attributes of your train set and any test sets you have perfectly aligned.

I would recommend you read in your feature map file as a Java HashMap<String, Integer> mapping from the word (String) to attribute index (Integer) for quick word look up when getting the attribute values of test set emails.

questions about using a standalone dataset to validate text classification with weka

1 Answers