Using CountVectorizer in Python Mapper Reducer

Question

I am trying to apply tokenizer using python mapper reducer function. I have following code but I keep getting error. reducer outputs values in a list and I am passing values to the vectorizer.

from mrjob.job import MRJob
from sklearn.cross_validation import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer

class bagOfWords(MRJob):

def mapper(self, _, line):
    cat, phrase, phraseid, sentiment = line.split(',')
    yield (cat, phraseid, sentiment), phrase

def reducer(self, keys, values):

    yield keys, list(values)

#Output: ["Train", "--", "2"] ["A series of escapades demonstrating the adage that    what is good for the goose", "A series", "A", "series"]

def mapper(self, keys, values):
    vectorizer = CountVectorizer(min_df=0)
    vectorizer.fit(values)
    x = vectorizer.transform(values)
    x=x.toarray()       
    yield keys, (x)


if __name__ == '__main__':
    bagOfWords.run()

ValueError: empty vocabulary; perhaps the documents only contain stop words

Thank you for any help you guys can provide.

ogrisel ogrisel · Accepted Answer · 2014-09-24T09:01:05

The CountVectorizer is stateful: you need to fit the same one instance on the full dataset to build the vocabulary hence this is not amenable to parallel processing.

Instead you can use the HashingVectorizer which is stateless (no need to fit, you can call transform directly).

Using CountVectorizer in Python Mapper Reducer

1 Answers