1
votes

I have a data set with two classes and was trying to get an optimal classifier using Weka. The best classifier I could obtain was about 79% accuracy. Then I tried adding attributes to my data by classifying it and saving the probability distribution generated by this classification in the data itself. When I reran the training process on the modified data I got over 93% accuracy!! I sure this is wrong but I can't exactly figure out why. These are the exact steps I went through:

  1. Open the data in Weka.
  2. Click on add Filter and select AddClassification from Supervised->attribute.
  3. Select a classifier. I select J48 with default settings.
  4. Set "Output Classification" to false and set Output Distribution to true.
  5. Run the filter and restore the class to be your original nominal class. Note the additional attributes added to the end of the attribute list. They will have the names: distribution_yourFirstClassName and distribution_yourSecondClassName.
  6. Go the Classify tab and select a classifier: again I selected J48.
  7. Run it. In this step I noticed much more accuracy than before.

Is this a valid way of creating classifiers? Didn't I "cheat" by adding classification information within the original data? If it is valid, how would one proceed to create a classifier that can predict unlabeled data? How can it add the additional attribute (the distribution) ?

I did try reproducing the same effect using a FilteredClassifier but it didn't work. Thanks.

2

2 Answers

1
votes

The process that you appear to have undertaken seems somewhat close to the Stacking ensemble method, where classifier outputs are used to generate an ensemble output (more on that here).

In your case however, the attributes and a previously trained classifier output is being used to predict your class. It is likely that most of the second J48 model's rules will be based on the first (As the class output will correlate more strongly to the J48 than the other attributes), but with some fine-tuning to improve model accuracy. In this case, the concept of 'two heads are better than one' is used to improve the overall performance of the model.

That's not to say that it is all good though. If you needed to use your J48 with unseen data, then you would not be able to use the same J48 that was used for your attributes (unless you saved it previously). Additionally, you are adding more processing work by using more than one classifier as opposed to the single J48. These costs would also need to be considered against the problem that you are tackling.

Hope this helps!

1
votes

Okay, here is how I did cascaded learning:

  1. I have the dataset D and divided into 10 equal sized stratified folds (D1 to D10) without repetition.
  2. I applied algorithm A1 to train a classifier C1 on D1 to D9 and then just like you, applied C1 on D10 to give me the additional distribution of positive and negative classes. I name this D10 with the additional two (or more, depending on what information from C1 you want to be included in D10) attributes/features as D10_new.
  3. Next, I applied the same algorithm to train a classifier C2 on D1 to D8 and D10 and then just like you, applied C2 on D9 to give me the additional distribution of positive and negative classes. I name this D9 with the additional attributes/features as D9_new.
  4. In this way I create D1_new to D10_new.
  5. Then I applied another classifier (perhaps with algorithm A2) on these D1_new to D10_new to predict the labels (a 10 fold CV is a good choice).

In this setup, you removed the bias of seeing the data prior to testing it. Also, it is advisable that A1 and A2 should be different.