8
votes

I'm trying to do Naive Bayes on a dataset that has over 6,000,000 entries and each entry 150k features. I've tried to implement the code from the following link: Implementing Bag-of-Words Naive-Bayes classifier in NLTK

The problem is (as I understand), that when I try to run the train-method with a dok_matrix as it's parameter, it cannot find iterkeys (I've paired the rows with OrderedDict as labels):

Traceback (most recent call last):
  File "skitest.py", line 96, in <module>
    classif.train(add_label(matr, labels))
  File "/usr/lib/pymodules/python2.6/nltk/classify/scikitlearn.py", line 92, in train
    for f in fs.iterkeys():
  File "/usr/lib/python2.6/dist-packages/scipy/sparse/csr.py", line 88, in __getattr__
    return _cs_matrix.__getattr__(self, attr)
  File "/usr/lib/python2.6/dist-packages/scipy/sparse/base.py", line 429, in __getattr__
    raise AttributeError, attr + " not found"
AttributeError: iterkeys not found

My question is, is there a way to either avoid using a sparse matrix by teaching the classifier entry by entry (online), or is there a sparse matrix format I could use in this case efficiently instead of dok_matrix? Or am I missing something obvious?

Thanks for anyone's time. :)

EDIT, 6th sep:

Found the iterkeys, so atleast the code runs. It's still too slow, as it has taken several hours with a dataset of the size of 32k, and still hasn't finished. Here's what I got at the moment:

matr = dok_matrix((6000000, 150000), dtype=float32)
labels = OrderedDict()

#collect the data into the matrix

pipeline = Pipeline([('nb', MultinomialNB())])
classif = SklearnClassifier(pipeline)

add_label = lambda lst, lab: [(lst.getrow(x).todok(), lab[x])
                              for x in xrange(lentweets-foldsize)] 

classif.train(add_label(matr[:(lentweets-foldsize),0], labels))
readrow = [matr.getrow(x + foldsize).todok() for x in xrange(lentweets-foldsize)]
data = np.array(classif.batch_classify(readrow))

The problem might be that each row that is taken doesn't utilize the sparseness of the vector, but goes through each of the 150k entry. As a continuation for the issue, does anyone know how to utilize this Naive Bayes with sparse matrices, or is there any other way to optimize the above code?

1
Perhaps you can encode your features more efficiently, or reduce their size?piokuc
true, but whatever the number of features I'm afraid I'll still need to manage the size of the matrix. The dataset consists of tweets' words.user1638859
Found the iterkeys atleast, now the problem is that the code is too slow.user1638859
Do you need to do it in Python? Have a look at MALLET: mallet.cs.umass.edu, it's pretty fast.piokuc
No need for python as per se, but we got peeps here familiar with it. thanks, I'll check that out. Still, I suppose it would be nice to get a definitive solution for the large data-sets, so that anyone googling this problem will have an answer here.user1638859

1 Answers

7
votes

Check out the document classification example in scikit-learn. The trick is to let the library handle the feature extraction for you. Skip the NLTK wrapper, as it's not intended for such large datasets.(*)

If you have the documents in text files, then you can just hand those text files to the TfidfVectorizer, which creates a sparse matrix from them:

from sklearn.feature_extraction.text import TfidfVectorizer
vect = TfidfVectorizer(input='filename')
X = vect.fit_transform(list_of_filenames)

You now have a training set X in the CSR sparse matrix format, that you can feed to a Naive Bayes classifier if you also have a list of labels y (perhaps derived from the filenames, if you encoded the class in them):

from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
nb.fit(X, y)

If it turns out this doesn't work because the set of documents is too large (unlikely since the TfidfVectorizer was optimized for just this number of documents), look at the out-of-core document classification example, which demonstrates the HashingVectorizer and the partial_fit API for minibatch learning. You'll need scikit-learn 0.14 for this to work.

(*) I know, because I wrote that wrapper. Like the rest of NLTK, it's intended for educational purposes. I also worked on performance improvements in scikit-learn, and some of the code I'm advertising is my own.