0
votes

I'm using the last weka developer version from here and I'm trying to create my first SVM models. My data is unbalanced. I want to create 10 models. Each model has to learn from 100 positive samples (1 class) against 900 negative samples (each class is unique) stored in Instances (one instances set for training a single model).

I'm new to this, so I've been checking out a SVM algorithm named RegSMOImproved from the weka.classifiers.functions.supportVector package. I asume this is a SVM variant with the use of regression.

I have already created 10 Instances and filled them with DenseInstance objects. Instances sets are therefore ready for training.

1. Is this (RegSMOImproved) a good algorithm for my problem?

2. If 1. is a YES: are there any options I need to set before training?

3. If 1. is a NO: what would be the correct choice for my problem, using the weka library?

If I left anything out, please leave a comment below and I will gladly add more info asap.

Thank You in advance.

EDIT 1:

So the main concern is my data. OK. I apologize for my bad English in advance. Will try and keep it as simple as possible.

Each class I want to classify is a plant species (unique). I have 10 species (meaning 10 classes). I've collected 100 samples of each species and took photos of their flowers. Each flower sample has a size, symmetry and a texture. I have described each flower with an array of features, segmenting the ROI and extracting texture features from relatively positioned points of interest. A feature array describes one sample flower as a whole (ordered features for every sample). Each feature array has exactly 100.000 texture features (type = double).

1000 samples (10 classes) will have to produce 10 models and each model should be trained 100 positive samples against 900 negative samples like one class against all other.

Class A: {100 positives, 900 negatives} 
Class B: {100 negatives, 100 positives, 800 negatives} 
Class C: {200 negatives, 100 positives, 700 negatives} 
... etc...

Class A:
Sample 1: {3.4500864,6.4582390,...}{class: positive}
etc.
Sample 101: {0.4594787, 2.543478,...}{class: negative}
etc.

It's really hard to compare one class to another in a multidimensional space when arrays are so long. Some of the species have similar flowers but this is not an obstacle since I am relying on textures to tell the difference (there are supposed to be some really good patterns inside each class - have seen some when I was testing my feature extraction procedures).

I want to compare a freshly taken unknown sample to each model and return a probability value between 0 and 1 from each comparison if this is possible at all. My determination is already excluding classes based on flower shape and size done outside machine learning procedures, meaning final determination based on machine learning will be made from half of the total number of models. Each model will be saved with a class name on my local hard drive so I will return 3 model names (they will be class names) at the end.

Please let me know if I unintentionally left anything out.

UPDATE:

These are the settings I've tried on my multi class classification problem and they gave me some promising results. It might be worth trying them out for any of you guys that want to make a multi class classifier with the option SMO.

MultiClassClassifier cModel = new MultiClassClassifier();
String options[] = {
        "-M","0",
        "-R","2.0",
        "-S","1",
        "-W","weka.classifiers.functions.SMO",
        "--",
        "-C","1",
        "-L","0.001",
        "-P","1.0e-12",
        "-M",
        "-N", "0",
        "-V","10",
        "-W","1",
        "-K", "weka.classifiers.functions.supportVector.PolyKernel -C 250007 -E 1.0"
};

I got this solution from here:

http://weka.8497.n7.nabble.com/meta-multi-class-classifier-with-the-option-smo-td26548.html

I've added "fitting of logistic models" with the option -M for SMO and "cross validation" with the option -V for SMO as suggested in the mailing list.

IMPORTANT: This option creates just a single trained model.

1
Please include the full error: message and trace-back. Can you also send sample data that would allow us to reproduce the problem? Fixing that could well be the fastest track to some solution. - Prune
Have found some errors in my code. Will try and fix them and add an edit when it's done. - c00ki3s

1 Answers

0
votes

Let's see ... you have several issues here.

A. I asume [RegSMOImproved] is a SVM variant with the use of regression.

Correct: the process minimizes the squared errors between observations and predictions. Wikipedia gives a nice overview. Note the epsilon restriction.

  1. Is this a good algorithm for my problem?

That depends on your data. Regression will get you a reasonable result for most applications, and will do so in reasonable time. Whether it's good depends on your definition of "good".

If your data are decently partitioned with no egregious outliers, then the various regression methods will work well -- but so will most methods. Having one or two of your positive points "far behind enemy lines" will move the SVM boundary a displeasing amount. If the infiltrators are close to the natural boundary, no problem.

  1. ... are there any options I need to set before training?

Again, it depends on the data. I recommend that you pay careful attention to that epsilon parameter, as it's a hard constraint. If you set it too small, you risk over-fitting. An appropriate value depends on your data mixture. If you have a very-outlier, an epsilon that is otherwise reasonable can throw off the entire process.

  1. ... what would be the correct choice ...?

Again, it depends on your data. What are the shapes of the regions you want classified? How many infiltrators are there? How far off are they? What influence do you want them to have? What outcome do you need from the trainings?

For instance, if the data come in concentric spheres, you'd want either a 1-class SVM on the inner sphere alone, or perhaps a spectral clustering method.

I hope this moves you toward a solution.


** Response to EDIT 1:** Thanks for the added information. It sounds as if you do, indeed, have a classic case for multi-class SVM. You are correct in training each class as "one vs all": this is exactly how other packages handle the problem. You might want to look up the term. Have you considered MultiClassClassifier with the SMO argument?

Most classification algorithms compute these partial probabilities, but then return the classification with the highest value, and fail to return the relative probabilities. I've cast about a little, and I haven't found anything that returns the class propensities or labels (probability of being in each class, must sum to 1.0).

You could do this with a label propagation graph: define a similarity metric, build a graph with your data points, label each point with its observed species -- 1,0 and nine 0,0 values -- and use neighbouring similarities for edge weights. However, I think this is likely to require too much memory overhead for your application.