I have a classification problem roughly described as follows: At work we have issue tracking software that is used for much of our internal communication. When you need help from another team, for example, you file an issue in this software. Each issue can be assigned one or more tags.
For example an issue might be filed for a new hire who is having their laptop setup for the first time entitled "Laptop setup for John Smith" and tagged "Laptop issues" and "New hire onboarding." So there can be multiple tags for a given issue.
I'm trying to build a classifier that takes a title of an issue and provides a list of suggested tags. I was asked by my supervisor to do this using the Naive Bayes algorithm, so that is what I am trying. I am using scikit-learn.
First of all, is it accurate to say that this is a "multilabel" classification task as described in the scikit-learn documentation (http://scikit-learn.org/stable/modules/multiclass.html)? That's what I think, but I don't quite understand the description of "Multioutput-multiclass classification" so I wasn't able to rule that out. Again, I'm predicting one or more classes for each sample.
Second, it looks like Naive-Bayes (at least in scikit-learn) doesn't actually support multilabel. Since I'm stuck (for now) using Naive-Bayes, I figured I could sort of roll my own multilabel classifier using the below. Does this seem like a reasonable approach?
- Train one Naive-Bayes binary classifier for each class (with the training data converted for each sample so that the label is simply 1 if the sample had that class among its various classes, and 0 otherwise).
- Then when I need a prediction for a sample, I'll get a prediction using each binary classifier, and my overall prediction will be the tags whose binary classifiers predicted one.
Finally, can you think of any better approaches? The huge downside of my plan is since there are about 2,000 tags, I would need to create 2,000 classifiers. This might not be completely prohibitive, but it isn't exactly ideal. Naive-bayes does support multiclass classification, so I wonder if there's some way I could sort of hack it on a single classifier (by looking at the probabilities generated for each class if they exist).