4
votes

I have a classification problem roughly described as follows: At work we have issue tracking software that is used for much of our internal communication. When you need help from another team, for example, you file an issue in this software. Each issue can be assigned one or more tags.

For example an issue might be filed for a new hire who is having their laptop setup for the first time entitled "Laptop setup for John Smith" and tagged "Laptop issues" and "New hire onboarding." So there can be multiple tags for a given issue.

I'm trying to build a classifier that takes a title of an issue and provides a list of suggested tags. I was asked by my supervisor to do this using the Naive Bayes algorithm, so that is what I am trying. I am using scikit-learn.

First of all, is it accurate to say that this is a "multilabel" classification task as described in the scikit-learn documentation (http://scikit-learn.org/stable/modules/multiclass.html)? That's what I think, but I don't quite understand the description of "Multioutput-multiclass classification" so I wasn't able to rule that out. Again, I'm predicting one or more classes for each sample.

Second, it looks like Naive-Bayes (at least in scikit-learn) doesn't actually support multilabel. Since I'm stuck (for now) using Naive-Bayes, I figured I could sort of roll my own multilabel classifier using the below. Does this seem like a reasonable approach?

  • Train one Naive-Bayes binary classifier for each class (with the training data converted for each sample so that the label is simply 1 if the sample had that class among its various classes, and 0 otherwise).
  • Then when I need a prediction for a sample, I'll get a prediction using each binary classifier, and my overall prediction will be the tags whose binary classifiers predicted one.

Finally, can you think of any better approaches? The huge downside of my plan is since there are about 2,000 tags, I would need to create 2,000 classifiers. This might not be completely prohibitive, but it isn't exactly ideal. Naive-bayes does support multiclass classification, so I wonder if there's some way I could sort of hack it on a single classifier (by looking at the probabilities generated for each class if they exist).

1
You should use multi-label classification. In multioutput-multiclass task, it is mandatory for a classifier to predict from different output tasks, so I dont think that matches your case. Anyways its not supported in scikit for now.Vivek Kumar

1 Answers

3
votes

The approach you propose is valid; it is actually the one-versus-rest approach generalized for the problem of multilabel classification and it is also known as binary relevance method. Since you are already using scikit-learn, the functionality you want is already implemented in the sklearn.multiclass.OneVsRestClassifier module.

The only requirement is to format your labels as an appropriate array of shape [n_samples, n_classes], but this is also trivial with the label encoders of scikit-learn.