1
votes

I am using 10 folds cross validations technique to train 200K records. The target class index is like

Status {PASS,FAIL}

Pass has ~144K and Fail has ~6K instances.

while training the model using J48. Its not able to find the failures. The accuracy is 95% but most the cases its predicting just success. where as in our case, we need to find the failure which are actually happening.

So my question is mainly hypothetical analysis.

  1. Does it really matter the distribution among class instances during training(in my case PASS,FAIL).

  2. What could be possible values in weka J48 tree to train better as i see 2% failure in every 1000 records i pass. So, there will be increase in success if we increase the Success scenarios.

  3. What should be the ratio among them in order to better train them.

There is nothing i could find in the API as far as ratio is concerned.

I am not adding the code because this is happening both with Java API as well as using weka GUI tool.

Many Thanks.

1

1 Answers

1
votes

The problem here is that your dataset is very unbalanced. You do have a few options on how to help your classification task:

  1. Generate synthetic instances for your minority class using an algorithm like SMOTE. This should increase your performance.
  2. It's not possible in every case, but you could maybe try splitting your majority class into a couple of smaller classes. This would help the balance.
  3. I believe Weka has a One Class Classifier. This allows to see decision boundary of the larger class and considers the minority class as an outlier allowing for hopefully better classifications. See here for Weka's implementation.

Edit: You could also use a classifier that will weight classifications based on whether they are correct or not. Again, Weka has this as a meta classifier that can be applied to most base classifiers, see here again.