0
votes

I am doing some text classification. Let's say I have 10 categories and 100 "samples", where each sample is a sentence of text. I have split my samples into 80:20 (training, testing) and trained the SVM classifier:

text_clf_svm = Pipeline([('vect', CountVectorizer(stop_words=('english'),ngram_range=(1,2))), ('tfidf', TfidfTransformer()),
                         ('clf-svm', SGDClassifier(loss='hinge', penalty='l2', random_state=42, learning_rate='adaptive', eta0=0.9))])

# Fit training data to SVM classifier, predict with testing data and print accuracy
text_clf_svm = text_clf_svm.fit(training_data, training_sub_categories)

Now when it comes to predicting, I do not want just a single category to be predicted. I want to see, for example, a list of the "top 5" categories for a given unseen sample as well as their associated probabilities:

top_5_category_predictions = text_clf_svm.predict(a_single_unseen_sample)

Since text_clf_svm.predict returns a value which represents the index of the categories available, I want to see something like this as output:

[(4,0.70),(1,0.20),(7,0.04),(9,0.06)]

Anyone know how to achieve this?

1
predict_proba will do part of the job (i.e. not the sorting part), but it can be used only with log and modified_huber loss, and not with hinge (i.e. SVM); see the docs - desertnaut

1 Answers

1
votes

This is something I had used a while back for a similar problem:

probs = clf.predict_proba(X_test)
# Sort desc and only extract the top-n
top_n_category_predictions = np.argsort(probs)[:,:-n-1:-1]

This will give you the top n categories for each sample.

If you also want to see the probabilities corresponding to these categories, then you can do:

top_n_probs = np.sort(probs)[:,:-n-1:-1]

Note: Here X_test is of shape (n_samples, n_features). So make sure you use your single_unseen_sample in the same format.