0
votes

I'm using Weka to pre-process dataset. The problem is that I have an attribute 'medical speciality' that contains a lot of labels more than 70 so by exploding it (change it from nominal to binary), I got 70 more attributes in the data set. So I found a way to minimize this number as much as possible:

here is an example of the values of labels to understand the idea:

*Pediatrics

*Pediatrics-Endocrnology

*Endocrinology

So I need to keep only Pediatrics and Endocrinology as attributes and for the instances in Pediatrics-Endocrnology they will have a 1 in Pediatrics and 1 in Endocrinology.

How can I do that with weka ?? Any suggestion ??

1

1 Answers

1
votes

You should be able to do this using the StringToWordVector unsupervised attribute filter. You'll need to:

  • make sure the attribute(s) are string, not nominal: if the data is already loaded from ARFF then use NominalToString, or alternatively if you're loading the source data from a CSV file then tick the Invoke options dialog box and enter the attribute index(es) in the stringAttributes field
  • add a hyphen - to the list of delimiters in the tokenizer options, if you want to split up words that are joined by a hyphen as in your question
  • set any other options you need in the StringToWordVector settings, for example you might want to set lowerCaseTokens true if the same word appears with different cases. You don't need to set anything for dictionaryFileToSaveTo unless you need to save a dictionary of the terms that were found by the filter.

That's assuming you're using the Weka Explorer GUI. If you're operating Weka via the command line or API, refer to the documentation to see how to invoke the filter and set the appropriate options.