1
votes

I am trying to use Weka for text classification. For this purpose, it makes sense to use the sparse ARFF data file format. Using Weka 3.7.2, I tried:

  1. Transforming a text directory to an Instances object using TextDirectoryLoader.
  2. Translating the strings resulting from the former stage to numbers using StringToWordVector.

The first stage worked fine. The second stage caused a problem, described this way in the Weka's ARFF file format specification:

Warning: There is a known problem saving SparseInstance objects from datasets that have string attributes. In Weka, string and nominal data values are stored as numbers; these numbers act as indexes into an array of possible attribute values (this is very efficient). However, the first string value is assigned index 0: this means that, internally, this value is stored as a 0. When a SparseInstance is written, string instances with internal value 0 are not output, so their string value is lost (and when the arff file is read again, the default value 0 is the index of a different string value, so the attribute value appears to change).

The ARFF format suggests this solution:

To get around this problem, add a dummy string value at index 0 that is never used whenever you declare string attributes that are likely to be used in SparseInstance objects and saved as Sparse ARFF files.

I am trying to do just that - add a dummy string. I have failed to do this manually (by editing the ARFF file). Can anyone who has done this already post an example - either of a program segment doing this, a properly modified ARFF file, or some other way to do this?

Thanks.

1
possible duplicate of Weka printing sparse arff fileYuval F

1 Answers

1
votes

Do NOT edit the arff file directly.

I just answered a similar question here: Weka printing sparse arff file

Use the same code example.