3
votes

I'd like to classify a set of 3d images (MRI). There are 4 classes (i.e. grade of disease A, B, C, D) where the distinction between the 4 grades is not trivial, therefore the labels I have for the training data is not one class per image. It's a set of 4 probabilities, one per class, e.g.

0.7   0.1  0.05  0.15
0.35  0.2  0.45  0.0
...

... would basically mean that

  • The first image belongs to class A with a probability of 70%, class B with 10%, C with 5% and D with 15%
  • etc., I'm sure you get the idea.

I don't understand how to fit a model with these labels, because scikit-learn classifiers expect only 1 label per training data. Using just the class with the highest probability results in miserable results.

Can I train my model with scikit-learn multilabel classification (and how)?

Please note:

  • Feature extraction is not the problem.
  • Prediction is not the problem.
1
Is your intent to predict classification of an image in any of the four classes, or rather "search" the probabilities defined elsewhere? What is your input data - image data itself, or some meta information on images? Where do the probabilities (labels on your data) originate from?miraculixx
I tried to formulate the setting in general, hoping that this would make it easier. But if that really helps, I can provide the following concretization: The input data are 3d scans of brains - but my problem is not how to calculate the relevant features (you call it meta information). The intention is to predict (four) probabilities that an 3d MRI belongs to (disease) class A, B, C and D. The distinction between the 4 classes is not trivial, therefore my labels are only probabilities (classified by doctors). The four probabilities will sum up to 1.0.texasWINthem
Is there a per-image unique and correct/best assignment of labels to these images? It seems to me you that in calculating probabilities for the four classes and using these as labels, you are essentially doing the work of the classifier. If you can use the classes A, B, C, D as labels, the predict_proba method will return a probability for each class for any given new input.miraculixx
You should probably send these probabilities as added features with along with highest class labels, and then see the results of predict_proba, if it changes anything. Anyways, as its defined now, the question is not suitable for stack-overflow. Please add this to stats.stackexchange.comVivek Kumar
How are the probabilities that you want to use as labels derived? Also you state prediction is not the problem. Maybe you don't need a machine learning algorithm but a search algorithm?miraculixx

1 Answers

-1
votes

Can I handle this somehow with the multilable classification framework?

For predict_proba to return the probability for each class A, B, C, D the classifier needs to be trained with one label per image.

If yes: How?

Use the image class as the label (Y) in your training set. That is your input dataset will look something like this:

F1  F2  F3  F4  Y

1   0   1   0   A
0   1   1   1   B
1   0   0   0   C
0   0   0   1   D
(...)

where F# are the features per each image and Y is the class as classified by doctors.

If no: Any other approaches?

For the case where you have more than one label per image, that is multiple potential classes or their respective probabilities, multilabel models might be a more appropriate choice, as documented in Multiclass and multilabel algorithms.