1
votes

I'm trying to make an SMS SPAM classifier using the WEKA library. I have a CSV file with "label" and "text" headings. When I use the code below, it creates an ARFF file with two attributes:

@attribute label {ham,spam}
@attribute text {'Go until jurong point','Ok lar...', etc.}

Currently, it seems that the text attribute is formatted as a nominal attribute with each message's text as a value. But I need the text attribute to be a String attribute, not a list of all of the text from all instances. Having the text attribute as a String will allow me to use the StringToWordVector filter for training a classifier.

// load CSV
CSVLoader loader = new CSVLoader();
loader.setSource(new File(args[0]));
Instances data = loader.getDataSet();

// save ARFF
ArffSaver saver = new ArffSaver();
saver.setInstances(data);
saver.setFile(new File(args[1]));
saver.setDestination(new File(args[1]));
saver.writeBatch();

I know I can create a String attribute like this:

Attribute tmp = new Attribute("tmp", (FastVector) null);

But I don't know how to replace the current attribute, or set the attribute type before reading in the CSV.

I tried inserting a new String attribute and deleting the current nominal attribute, but this deletes all of the SMS text along with it. I also tried using renameAttributeValue, but this doesn't seem to work for changing the attribute type.

EDIT: I suspect that this NominalToString filter will do the job, but I'm not sure how to use it.

Any suggestions would be much appreciated. Thanks!

1

1 Answers

8
votes

This did the trick. It changed the text attribute type, but not the label attribute type (though I'm not sure why it did one but not the other).

NominalToString filter1 = new NominalToString();
filter1.setInputFormat(data);
data = Filter.useFilter(data, filter1);

There's a small tip here

By default, non-numerical attributes get imported as NOMINAL attributes, which is not necessarily desired for textual data, especially if one wants to use the StringToWordVector filter. In order to change the attribute to STRING, one can run the NominalToString filter (package weka.filters.unsupervised.attribute) on the data, specifying the attribute index or range of indices that should be converted (NB: this filter does not exclude the class attribute from conversion!).