I am doing some text classification. Let's say I have 10 categories and 100 "samples", where each sample is a sentence of text. I have split my samples into 80:20 (training, testing) and trained the SVM classifier:
text_clf_svm = Pipeline([('vect', CountVectorizer(stop_words=('english'),ngram_range=(1,2))), ('tfidf', TfidfTransformer()),
('clf-svm', SGDClassifier(loss='hinge', penalty='l2', random_state=42, learning_rate='adaptive', eta0=0.9))])
# Fit training data to SVM classifier, predict with testing data and print accuracy
text_clf_svm = text_clf_svm.fit(training_data, training_sub_categories)
Now when it comes to predicting, I do not want just a single category to be predicted. I want to see, for example, a list of the "top 5" categories for a given unseen sample as well as their associated probabilities:
top_5_category_predictions = text_clf_svm.predict(a_single_unseen_sample)
Since text_clf_svm.predict returns a value which represents the index of the categories available, I want to see something like this as output:
[(4,0.70),(1,0.20),(7,0.04),(9,0.06)]
Anyone know how to achieve this?
predict_probawill do part of the job (i.e. not the sorting part), but it can be used only withlogandmodified_huberloss, and not withhinge(i.e. SVM); see the docs - desertnaut