1
votes

I am very new to machine learning. Sorry if there are any mistakes in my English.

I am using the weka J48 Classification for prediction in true or false. I have almost 999K training set which i used to train the model. I used the cross validation method with 3 folds to train the Model which gives me accuracy of ~84%.

Now after storing the model. i tried to test it on 50k dataset. which is giving very bad results and 50% of them are mismatch. I have 11 attributes with nominal and numeric fields.

I dont know why its happening.

I have two questions.

  1. How can i train to perform better on test set.
  2. what could be possible issues.

I am using weka api in java.

1
How did you choose the 50K set to test?Tim Biegeleisen
Actually, i am using 30 days of data of training and 1 day of data for testing and predicting.Pandit
How are you obtaining the 1 day of test data?Tim Biegeleisen
i am getting in CSV file which i am then converting to ARFF.Pandit

1 Answers

2
votes

It means that your model is overfit for your 999k training set and doesn't generalize well to your 50k testing set.

You should look into cross-validating with (a good portion, but not all) of your 50k dataset in addition to your 999k.

You may also want to try something higher than a k=3, k-fold crossvalidation, because k=3 folds may be too "coarse". Good luck!