2
votes

Pardon me if I use the wrong terminology but what I want is to train a set of data (using GaussianNB Naive Bayes from Scikit Learn), save the model/classifier and then load it whenever I need and predict a category.

from sklearn.externals import joblib
from sklearn.naive_bayes import GaussianNB
from sklearn.feature_extraction.text import TfidfVectorizer

self.vectorizer = TfidfVectorizer(decode_error='ignore')
self.X_train_tfidf = self.vectorizer.fit_transform(train_data)

# Fit the model to my training data
self.clf = self.gnb.fit(self.X_train_tfidf.toarray(), category)

# Save the classifier to file
joblib.dump(self.clf, 'trained/NB_Model.pkl')

# Save the vocabulary to file
joblib.dump(self.vectorizer.vocabulary_, 'trained/vectorizer_vocab.pkl')


#Next time, I read the saved classifier
self.clf = joblib.load('trained/NB_Model.pkl')

# Read the saved vocabulary
self.vocab =joblib.load('trained/vectorizer_vocab.pkl')

# Initializer the vectorizer
self.vectorizer = TfidfVectorizer(vocabulary=self.vocab, decode_error='ignore')

# Try to predict a category for new data
X_new_tfidf = self.vectorizer.transform(new_data)
print self.clf.predict(X_new_tfidf.toarray())

# After running the predict command above, I get the error
'idf vector is not fitted'

Can anyone tell me what I'm missing?

Note: The saving of the model, the reading of the saved model and trying to predict a new category are all different methods of a class. I have collapsed all of them into a single screen here to make for easier reading.

Thanks

1

1 Answers

6
votes

You need to pickle the self.vectorizer and load it again. Currently you are only saving the vocabulary learnt by the vectorizer.

Change the following line in your program:

joblib.dump(self.vectorizer.vocabulary_, 'trained/vectorizer_vocab.pkl')

to:

joblib.dump(self.vectorizer, 'trained/vectorizer.pkl')

And the following line:

self.vocab =joblib.load('trained/vectorizer_vocab.pkl')

to:

self.vectorizer =joblib.load('trained/vectorizer.pkl')

Delete this line:

self.vectorizer = TfidfVectorizer(vocabulary=self.vocab, decode_error='ignore')

Problem explanation:

You are correct in your thinking to just save the vocabulary learnt and reuse it. But the scikit-learn TfidfVectorizer also has the idf_ attribute which contains the IDF of the saved vocabulary. So you need to save that also. But even if you save both and load them both in a new TfidfVectorizer instance, then also you will get the "not_fitted" error. Because thats just the way most of the scikit transformers and estimators are defined. So without doing anything "hacky" saving the whole vectorizer is your best bet. If you still want to go onto the saving the vocabulary path, then please take a look here to how to properly do that:

The above page saves vocabulary into json and idf_ into a simple array. You can use pickles there, but you will get the idea about the working of TfidfVectorizer.

Hope it helps.