0
votes

I have the following scenario: I need to distinguish from a list of strings (500,000 of them) which strings related to businesses and which are persons.

A reductive example of the problem:

  1. Stackoverflow LLC -> Business
  2. John Doe -> Person
  3. John Doe Inc. -> Business

Luckily for me, I have 500,000 names labeled, so this becomes a supervised problem. Yay.

The first model I ran was a simple Naive Bayes (multinomial), below is the code:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df["CUST_NM_CLEAN"], 
                                                    df["LABEL"],test_size=0.20, 
                                                    random_state=1)

# Instantiate the CountVectorizer method
count_vector = CountVectorizer()

# Fit the training data and then return the matrix
training_data = count_vector.fit_transform(X_train)

# Transform testing data and return the matrix. 
testing_data = count_vector.transform(X_test)

#in this case we try multinomial, there are two other methods
from sklearn.naive_bayes import cNB
naive_bayes = MultinomialNB()
naive_bayes.fit(training_data,y_train)
#MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

predictions = naive_bayes.predict(testing_data)


from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
print('Accuracy score: {}'.format(accuracy_score(y_test, predictions)))
print('Precision score: {}'.format(precision_score(y_test, predictions, pos_label='Org')))
print('Recall score: {}'.format(recall_score(y_test, predictions, pos_label='Org')))
print('F1 score: {}'.format(f1_score(y_test, predictions, pos_label='Org')))

Results i'm getting:

  • Accuracy score: 0.9524850665857665
  • Precision score: 0.9828196680932295
  • Recall score: 0.8890405236039549
  • F1 score: 0.9335809546092653

Not too shabby for the first go. However when I export the results to a file, and comparing the predictions to the label, I'm getting a very low accuracy, somewhere in the realm of 60%. This is very far from the 95% score that sklearn is outputting...

Any ideas?

Here is how I'm outputting the file, this might be the case:

mnb_results = np.array(list(zip(df["CUST_NM_CLEAN"].values.tolist(),df["LABEL"],predictions)))
mnb_results = pd.DataFrame(mnb_results, columns=['name','predicted', 'label'])
mnb_results.to_csv('mnb_vectorized.csv', index = False)

P.s. I'm a newbie here, so I apologize if there is a clear solutoin here.

1
One thing to notice was the export to csv. If you are validating using the csv then I think you will need to export x_test, y_test, predictions. Also, cross-validation can also be done to check if it is performing as expected.coldy
You sir are savior. For any future viewers, i changed the code to: mnb_results = np.array(list(zip(X_test, y_test, predictions)))mikelowry
I will add this as an answer, you could accept it :)coldy

1 Answers

1
votes

One thing to notice was the export to csv. If you are validating using the csv then I think you will need to export x_test, y_test, predictions. Also, cross-validation can also be done to check if it is performing as expected.

Old:

mnb_results = np.array(list(zip(df["CUST_NM_CLEAN"].values.tolist(),df["LABEL"],predictions)))

Changed:

mnb_results = np.array(list(zip(X_test, y_test, predictions)))

More details:

# Get the accuracy score using numpy, (Similarly others):
import numpy as np
true = np.asarray([1.0, 0.0, 1.0, 1.0, 0.0, 1.0, 1.0, 0.0, 1.0, 1.0])
predictions = np.asarray([1.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0])
print("Accuracy:{}".format(np.mean(true==predictions)))