I'm using the last weka developer version from here and I'm trying to create my first SVM models. My data is unbalanced. I want to create 10 models. Each model has to learn from 100 positive samples (1 class) against 900 negative samples (each class is unique) stored in Instances (one instances set for training a single model).
I'm new to this, so I've been checking out a SVM algorithm named RegSMOImproved from the weka.classifiers.functions.supportVector package. I asume this is a SVM variant with the use of regression.
I have already created 10 Instances and filled them with DenseInstance objects. Instances sets are therefore ready for training.
1. Is this (RegSMOImproved) a good algorithm for my problem?
2. If 1. is a YES: are there any options I need to set before training?
3. If 1. is a NO: what would be the correct choice for my problem, using the weka library?
If I left anything out, please leave a comment below and I will gladly add more info asap.
Thank You in advance.
EDIT 1:
So the main concern is my data. OK. I apologize for my bad English in advance. Will try and keep it as simple as possible.
Each class I want to classify is a plant species (unique). I have 10 species (meaning 10 classes). I've collected 100 samples of each species and took photos of their flowers. Each flower sample has a size, symmetry and a texture. I have described each flower with an array of features, segmenting the ROI and extracting texture features from relatively positioned points of interest. A feature array describes one sample flower as a whole (ordered features for every sample). Each feature array has exactly 100.000 texture features (type = double).
1000 samples (10 classes) will have to produce 10 models and each model should be trained 100 positive samples against 900 negative samples like one class against all other.
Class A: {100 positives, 900 negatives}
Class B: {100 negatives, 100 positives, 800 negatives}
Class C: {200 negatives, 100 positives, 700 negatives}
... etc...
Class A:
Sample 1: {3.4500864,6.4582390,...}{class: positive}
etc.
Sample 101: {0.4594787, 2.543478,...}{class: negative}
etc.
It's really hard to compare one class to another in a multidimensional space when arrays are so long. Some of the species have similar flowers but this is not an obstacle since I am relying on textures to tell the difference (there are supposed to be some really good patterns inside each class - have seen some when I was testing my feature extraction procedures).
I want to compare a freshly taken unknown sample to each model and return a probability value between 0 and 1 from each comparison if this is possible at all. My determination is already excluding classes based on flower shape and size done outside machine learning procedures, meaning final determination based on machine learning will be made from half of the total number of models. Each model will be saved with a class name on my local hard drive so I will return 3 model names (they will be class names) at the end.
Please let me know if I unintentionally left anything out.
UPDATE:
These are the settings I've tried on my multi class classification problem and they gave me some promising results. It might be worth trying them out for any of you guys that want to make a multi class classifier with the option SMO.
MultiClassClassifier cModel = new MultiClassClassifier();
String options[] = {
"-M","0",
"-R","2.0",
"-S","1",
"-W","weka.classifiers.functions.SMO",
"--",
"-C","1",
"-L","0.001",
"-P","1.0e-12",
"-M",
"-N", "0",
"-V","10",
"-W","1",
"-K", "weka.classifiers.functions.supportVector.PolyKernel -C 250007 -E 1.0"
};
I got this solution from here:
http://weka.8497.n7.nabble.com/meta-multi-class-classifier-with-the-option-smo-td26548.html
I've added "fitting of logistic models" with the option -M for SMO and "cross validation" with the option -V for SMO as suggested in the mailing list.
IMPORTANT: This option creates just a single trained model.