4
votes

This question is about Multilabel multiclass classification for clustering tasks. Here is a nice definition of the two to make sure no one confuses both:

Multiclass classification means a classification task with more than two classes; e.g., classify a set of images of fruits which may be oranges, apples, or pears. Multiclass classification makes the assumption that each sample is assigned to one and only one label: a fruit can be either an apple or a pear but not both at the same time.

Multilabel classification assigns to each sample a set of target labels. This can be thought as predicting properties of a data-point that are not mutually exclusive, such as topics that are relevant for a document. A text might be about any of religion, politics, finance or education at the same time or none of these.

From this definition of multilabel classification, we can understand that a sample can possibly have multiple true binary labels so a sample text that is about religion and politics would have a target looking like this: y = [1,1,0,0].

What if instead of having binary labels, we had probabilities or scores. So our target would now look like this instead: y = [0.5, 0.4, 0.0, 0.1] where the probabilities sum to 1 for example. The document is 50% religion, 40% politics and 10% education. Of course, labelling datasets like this is not really feasible so let's look at another set of tasks, more precisely clustering tasks, in order to see how this could happen.

For clustering tasks, we have a dataset [a,b,c,d,e] and its set of true clusters [abce,d]. Clustering could be seen as a classification task where the classes are a set of actions: merge with an active cluster or start a new one. Imagine a system that incrementally builds these clusters. It will of course make mistakes hence making incoherent clusters [ab,c,d] in the process. When looking at the next sample e, it is now impossible to tell exactly which cluster it should be added to because its true cluster is now divided in two. Since we know the set of correct clusters, we could assign each action (or potential merge) a precision- or recall-based score y = [0.5, 0.3, 0, 0.2] (these numbers are the result of my imagination, not precision nor recall). So what is our label here? Should we merge with any of these clusters or should we start a new cluster containing only e?

An easy solution would be to take the highest score as our true label or latent action for lack of a better term and use normal classification cost functions. This would mean that our latent action merge e->ab is the only true answer and everything else is equally bad. In my opinion this seems wrong because both actions merge e->c and merge e->d would be penalized the same way even though the former is not necessarily wrong.

Going back to multilabel classification, are there any cost functions out there that allow for such "weighted labels" instead of 1s and 0s. Or am I looking with this at the wrong angle?

1
In clustering,you do not have the correct labels. If you have labels, it is classification!Has QUIT--Anony-Mousse
Hmm I might not be using the correct terminology then. For example, in coreference resolution you do have the correct coreference chains or "clusters". I tried being as general as possible.Fabrice Dugas

1 Answers

5
votes

i'm actually working on a PhD close to this topic, trying to come up with a sensible clustering approach for the output space. For now I've tried to use community detection approaches from network science to cluster the spaces - you can check my paper about data-driven label space division in multi-label classification for some hints. I am constructing a weighted and unweighted graph based on label co-occurence from training data and use a variety of community detection algorithms to come up with a division - then classify in each cluster and merge results.

The weighted graph approach is somewhat related to your question - as labels' relations are being weighted by the amount of documents they appear in.

I am also providing my implementation as part of the python scikit-multilearn package - you can try to play with it - implementing a new clustering approach is easy and documented here. Tell me if you come with something, I hope I helped a little.