I am trying to use scikit-learn 0.12.1 to:
- train a LogisticRegression classifier
- evaluate the classifer on held out validation data
- feed new data to this classifier and retrieve the 5 most probable labels for each observation
Sklearn makes all of this very easy except for one peculiarity. There is no guarantee that every possible label will occur in the data used to fit my classifier. There are hundreds of possible labels and some of them have not occurred in the training data available.
This results in 2 problems:
- The label vectorizer doesn't recognize previously unseen labels when they occur in the validation data. This is easily fixed by fitting the labeler to the set of possible labels but it exacerbates problem 2.
- The output of the predict_proba method of the LogisticRegression classifier is an [n_samples, n_classes] array, where n_classes consists only of the classes seen in the training data. This means running argsort on the predict_proba array no longer provides values that directly map to the label vectorizer's vocabulary.
My question is, what's the best way to force the classifier to recognize the full set of possible classes, even when some of them don't occur in the training data? Obviously it will have trouble learning about labels it has never seen data for, but 0's are perfectly useable in my situation.