0
votes

There's a data pre-processing problem with using the python-weka-wrapper v0.3.10 that I'd confusing for a couple of days.

I'm using the create_instances_from_matrices() to generate my dataset from ndarray to instance, which is numeric type.

And then save the dataset into an arff file(numeric_data.arff) via

Saver(classname="weka.core.converters.ArffSaver")

Then tried to transform my dataset to nominal type with

Filter(classname="weka.filters.unsupervised.attribute.NumericToNominal", options=["-R", "first-last"])

The exception message output like:

Exception in thread "Thread-0" java.lang.IllegalArgumentException: A nominal attribute (x2) cannot have duplicate labels (1).

However, with the same dataset(numeric_data.arff) I generated, it can be transform to nominal type via Weka GUI Explorer v3.8.1 successfully.

I'll be appreciated with any idea can help.

Thanks!

2

2 Answers

0
votes

The problem is most likely that you have small values (< 10^6), which Weka all turns to 0.0 when saving (Weka only outputs 6 digits after the decimal point by default). If you apply your filter to the dataset before you save it, it should work.

Alternatively, you can tell the ArffSaver how many decimals you would like to use when saving the file (-decimal option). See also the Javadoc of the ArffSaver class.

0
votes

(excerpt from discussion on the mailing list, to answer this question properly)

The NumericToNominal filter is a very dumb filter. Its sole purpose was to have an easy way of converting numeric values that are supposed to be categorical ones (e.g., 1.0, 2.0 into labels 1 and 2). When turning numbers into labels, it cuts the decimals off after 6 decimals. However, it didn't check whether this could lead to generating duplicate labels. E.g., 2.00000001 and 2.0 will both generate 2 (or 2.0) as label.

You can use the NumericCleaner filter to remove these trailing decimals, e.g.:

# round to specified number of decimals
decimals = 3
cleaner = Filter(classname="weka.filters.unsupervised.attribute.NumericCleaner", options=["-R", "first-last", "-decimals", str(decimals)])
cleaner.inputformat(data)
data = cleaner.filter(data)