I'm new to python and scikit, so please bear with me if this is a stupid question. I've followed some tutorials in order to make a multinomial naive bayes classifier using sklearn, and I've trained and tested it to a decent accuracy. However, I've reached the end of the tutorials, and have realized I don't actually know how to feed new data for it to classify. Here's my code:
import sklearn as skl;
import pandas as pd;
from sklearn.metrics import accuracy_score, precision_score, recall_score;
from sklearn.model_selection import train_test_split;
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB;
from sklearn.metrics import confusion_matrix;
import matplotlib.pyplot as plt;
import seaborn as sns;
import numpy as np;
def print_top10(vectorizer, clf):
feature_names = vectorizer.get_feature_names()
class_labels = clf.classes_
for i, class_label in enumerate(class_labels):
top10 = np.argsort(clf.coef_[0])[-10:]
print("%s: %s" % (class_label,
" ".join(feature_names[j] for j in top10)))
df = pd.read_excel(r'C:\Users\Nicholas\vegas700.xlsx');
#edit:
df2 = pd.read_excel(r'C:\Users\Nicholas\vegasunlabeled.xlsx');
X_train, X_test, y_train, y_test = train_test_split(df['text'], df['label'], random_state=11, test_size=0.25);
#edit:
finalx_train, finalx_test, finaly_train, finaly_test = train_test_split(df['text'], df['label'], random_state=1, test_size=0.99)
cv = CountVectorizer(strip_accents='ascii', token_pattern=u'(?ui)\\b\\w*[a-z]+\\w*\\b', lowercase=True, stop_words='english');
X_train_cv = cv.fit_transform(X_train.values.astype('U'));
X_test_cv = cv.transform(X_test.values.astype('U'));
#edit:
finalx_cv = cv.transform(finalx_test.values.astype('U'));
print("training...");
mnb = MultinomialNB();
mnb.fit(X_train_cv, y_train);
#edit:
new_predictions = mnb.predict_log_proba(finalx_cv)
print(new_predictions)
How do I use/give my classifier a new data set, and how do I get it to give me the percentage appearance of each class in that new set?
Edit:
vegas700.xlsx has three columns: in order from left to right they are called 'id', 'text', and 'label'. id is just the item number, text is the text, and label is a class, either 0 or 1.
After adding the new lines of code, I get a result of:
[[-8.24928263e+00 -2.61480227e-04]
[-4.33474053e+00 -1.31919059e-02]
[-3.81104731e+00 -2.23734239e-02]
...
[-1.62156753e-04 -8.72702816e+00]
[-3.35454988e+00 -3.55495505e-02]
[-1.16414198e-01 -2.20824326e+00]]
I have no idea what this means, and no idea if it is correct.
mnb.predict(new_dataset). By the way, you do not need to use;at the end of each line. - moys;, it's a habit from java - NickX_trainthat you have used to train the model (even the same columns as the X_train) - moysvegas700.xlsxandvegasunlabeled.xlsxfiles are in the exact same format. I assume that doing anothertrain_test_splitonvegasunlabeledisn't the right way to go about it, but I don't really understand what X_train is. - Nick