How to do proper testing in Weka and how to get desired results ?

Question

I am currently working over a application of ANN, SVM and Linear Regression methods for prediction of fruit yield of a region based on meteorological factors (13 factors ) Total data set is: 36

While Implementing those methods on WEKA I am getting BAD results: Like in the case of MultilayerPreceptron my results are : (i divided the dataset with 28 for training and 8 for test ) === Run information ===

Scheme: weka.classifiers.functions.MultilayerPerceptron -L 0.3 -M 0.2 -N 500 -V 0 -S 0 -E 20 -H a -G -R Relation: apr6_data Instances: 28 Attributes: 15

Time taken to build model: 3.69 seconds

=== Predictions on test set ===

inst# actual predicted error 1 2.551 2.36 -0.191 2 2.126 3.079 0.953 3 2.6 1.319 -1.281 4 1.901 3.539 1.638 5 2.146 3.635 1.489 6 2.533 2.917 0.384 7 2.54 2.744 0.204 8 2.82 3.473 0.653

=== Evaluation on test set === === Summary ===

Correlation coefficient -0.4415 Mean absolute error 0.8493 Root mean squared error 1.0065 Relative absolute error 144.2248 % Root relative squared error 153.5097 % Total Number of Instances 8

In case of SVM for regression : inst# actual predicted error 1 2.551 2.538 -0.013 2 2.126 2.568 0.442 3 2.6 2.335 -0.265 4 1.901 2.556 0.655 5 2.146 2.632 0.486 6 2.533 2.24 -0.293 7 2.54 2.766 0.226 8 2.82 3.175 0.355

=== Evaluation on test set === === Summary ===

Correlation coefficient 0.2888 Mean absolute error 0.3417 Root mean squared error 0.3862 Relative absolute error 58.0331 % Root relative squared error 58.9028 % Total Number of Instances 8

What can be the possible error in my application ? Please let me know ! Thanks

You'll have to give us some more information. What are the attributes that you are using (type, range)? What exactly are you trying to predict? Also, in general you'll need a dataset much larger than 36 to achieve good results. — Lars Kotthoff
I am basically using year(1),annual_rainfall(1), max_temperature(4),min_temperature(4),solar_radiation(4). Max_temperature,min_temperature and solar radiations are each 4 factors as I am considering only fruit season... i.e. nov-dec, dec-jan, jan-feb, feb-mar. So each becomes a factor. All are numeric data with range as follows: — solvesak
year: 1973-74 to 2008-2009 annual rainfall: 365mm to 1617mm radiation around 11 to 20 ... max and min temperature in degrees centigrade. yield is in tons/hectare. — solvesak
Do I need to normalize the data ? I guess it is being done by WEKA classifiers. — solvesak
You might have better luck with discretising the prediction e.g. into low/medium/high yield. — Lars Kotthoff

Rushdi Shams Rushdi Shams · Accepted Answer · 2012-04-08T22:30:34

Do I need to normalize the data ? I guess it is being done by WEKA classifiers.

If you want to normalize data, you have to do it. Preprocess tab - > Filters (choose) -> then find normalize and then click apply.

If you want to discretize your data, you have to follow the same process.

You might have better luck with discretising the prediction e.g. into low/medium/high yield.

You need to normalize or discretize- this cannot be said based on your data or on your single run. For instance, discretization brings in better result for naive baye's classifiers. For SVM- not sure.

I did not see your Precision, Recall or F-score from your data. But as you are saying you have bad results on test set, then it is very possible that your classifier is experiencing overfitting. Try to increase training instances (36 is too less I guess). Keep us posting what is happening when you increase training instances.

How to do proper testing in Weka and how to get desired results ?

1 Answers