I have a data set with two classes and was trying to get an optimal classifier using Weka. The best classifier I could obtain was about 79% accuracy. Then I tried adding attributes to my data by classifying it and saving the probability distribution generated by this classification in the data itself. When I reran the training process on the modified data I got over 93% accuracy!! I sure this is wrong but I can't exactly figure out why. These are the exact steps I went through:
- Open the data in Weka.
- Click on add Filter and select
AddClassification
fromSupervised->attribute
. - Select a classifier. I select
J48
with default settings. - Set "Output Classification" to false and set
Output Distribution
to true. - Run the filter and restore the class to be your original nominal class. Note the additional attributes added to the end of the attribute list. They will have the names:
distribution_yourFirstClassName
anddistribution_yourSecondClassName
. - Go the Classify tab and select a classifier: again I selected
J48
. - Run it. In this step I noticed much more accuracy than before.
Is this a valid way of creating classifiers? Didn't I "cheat" by adding classification information within the original data? If it is valid, how would one proceed to create a classifier that can predict unlabeled data? How can it add the additional attribute (the distribution) ?
I did try reproducing the same effect using a FilteredClassifier
but it didn't work.
Thanks.