This question is about Multilabel multiclass classification for clustering tasks. Here is a nice definition of the two to make sure no one confuses both:
Multiclass classification means a classification task with more than two classes; e.g., classify a set of images of fruits which may be oranges, apples, or pears. Multiclass classification makes the assumption that each sample is assigned to one and only one label: a fruit can be either an apple or a pear but not both at the same time.
Multilabel classification assigns to each sample a set of target labels. This can be thought as predicting properties of a data-point that are not mutually exclusive, such as topics that are relevant for a document. A text might be about any of religion, politics, finance or education at the same time or none of these.
From this definition of multilabel classification, we can understand that a sample can possibly have multiple true binary labels so a sample text that is about religion and politics would have a target looking like this: y = [1,1,0,0]
.
What if instead of having binary labels, we had probabilities or scores. So our target would now look like this instead: y = [0.5, 0.4, 0.0, 0.1]
where the probabilities sum to 1 for example. The document is 50% religion, 40% politics and 10% education. Of course, labelling datasets like this is not really feasible so let's look at another set of tasks, more precisely clustering tasks, in order to see how this could happen.
For clustering tasks, we have a dataset [a,b,c,d,e]
and its set of true clusters [abce,d]
. Clustering could be seen as a classification task where the classes are a set of actions: merge with an active cluster or start a new one. Imagine a system that incrementally builds these clusters. It will of course make mistakes hence making incoherent clusters [ab,c,d]
in the process. When looking at the next sample e
, it is now impossible to tell exactly which cluster it should be added to because its true cluster is now divided in two. Since we know the set of correct clusters, we could assign each action (or potential merge) a precision- or recall-based score y = [0.5, 0.3, 0, 0.2]
(these numbers are the result of my imagination, not precision nor recall). So what is our label here? Should we merge with any of these clusters or should we start a new cluster containing only e
?
An easy solution would be to take the highest score as our true label or latent action for lack of a better term and use normal classification cost functions. This would mean that our latent action merge e->ab
is the only true answer and everything else is equally bad. In my opinion this seems wrong because both actions merge e->c
and merge e->d
would be penalized the same way even though the former is not necessarily wrong.
Going back to multilabel classification, are there any cost functions out there that allow for such "weighted labels" instead of 1s and 0s. Or am I looking with this at the wrong angle?