1
votes

I'm new to python and scikit, so please bear with me if this is a stupid question. I've followed some tutorials in order to make a multinomial naive bayes classifier using sklearn, and I've trained and tested it to a decent accuracy. However, I've reached the end of the tutorials, and have realized I don't actually know how to feed new data for it to classify. Here's my code:

import sklearn as skl;
import pandas as pd;
from sklearn.metrics import accuracy_score, precision_score, recall_score;
from sklearn.model_selection import train_test_split;
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB;
from sklearn.metrics import confusion_matrix;
import matplotlib.pyplot as plt;
import seaborn as sns;
import numpy as np;

def print_top10(vectorizer, clf):
    feature_names = vectorizer.get_feature_names()
    class_labels = clf.classes_
    for i, class_label in enumerate(class_labels):
        top10 = np.argsort(clf.coef_[0])[-10:]
        print("%s: %s" % (class_label,
              " ".join(feature_names[j] for j in top10)))

df = pd.read_excel(r'C:\Users\Nicholas\vegas700.xlsx');

#edit:
df2 = pd.read_excel(r'C:\Users\Nicholas\vegasunlabeled.xlsx');

X_train, X_test, y_train, y_test = train_test_split(df['text'], df['label'], random_state=11, test_size=0.25);

#edit:
finalx_train, finalx_test, finaly_train, finaly_test = train_test_split(df['text'], df['label'], random_state=1, test_size=0.99)

cv = CountVectorizer(strip_accents='ascii', token_pattern=u'(?ui)\\b\\w*[a-z]+\\w*\\b', lowercase=True, stop_words='english');

X_train_cv = cv.fit_transform(X_train.values.astype('U'));
X_test_cv = cv.transform(X_test.values.astype('U'));
#edit:
finalx_cv = cv.transform(finalx_test.values.astype('U'));

print("training...");
mnb = MultinomialNB();
mnb.fit(X_train_cv, y_train);
#edit:
new_predictions = mnb.predict_log_proba(finalx_cv)
print(new_predictions)

How do I use/give my classifier a new data set, and how do I get it to give me the percentage appearance of each class in that new set?

Edit: vegas700.xlsx has three columns: in order from left to right they are called 'id', 'text', and 'label'. id is just the item number, text is the text, and label is a class, either 0 or 1.

After adding the new lines of code, I get a result of:

[[-8.24928263e+00 -2.61480227e-04]
 [-4.33474053e+00 -1.31919059e-02]
 [-3.81104731e+00 -2.23734239e-02]
 ...
 [-1.62156753e-04 -8.72702816e+00]
 [-3.35454988e+00 -3.55495505e-02]
 [-1.16414198e-01 -2.20824326e+00]]

I have no idea what this means, and no idea if it is correct.

1
Try mnb.predict(new_dataset). By the way, you do not need to use ; at the end of each line. - moys
@moys what form should the new dataset be in? should it be a pandas dataframe? or a raw excel file? also sorry about the ;, it's a habit from java - Nick
It should be in the same form as X_train that you have used to train the model (even the same columns as the X_train) - moys
@moys I've realized that's what I need to figure out how to do. I'm not sure what form exactly X_train is, but the vegas700.xlsx and vegasunlabeled.xlsx files are in the exact same format. I assume that doing another train_test_split on vegasunlabeled isn't the right way to go about it, but I don't really understand what X_train is. - Nick

1 Answers

0
votes

Your issue is using predict_log_proba instead of just predict. What you are seeing is the log of the probability that each sample is 0 or 1, which is helpful if you want to see how "sure" your model is of each label. If you only want to see the labels themselves, use predict. More info here.

Edit: Since this is a simple two class problem, you just need to sum the predicted outputs and divide by the shape for the percentage of samples labeled 1:

preds = mnb.predict(x)
print(100*preds.sum()/len(preds))

One more suggestion for expanding to new datasets, I would look into the pipeline feature of sklearn. That way you can create a pipeline that incorporates any transformations and quickly go from file to new dataset to predict on. Also, you don't need the train test split for the new data.