scikit-learn: fitting data into chunks vs fitting it all at once

Question

I am using scikit-learn to build a classifier, which works on (somewhat large) text files. I need a simple bag-of-words features at the moment, therefore I tried using TfidfVectorizer/HashingVectorizer/CountVectorizer to obtain the feature vectors.

However, processing the entire train data at once to obtain the feature vectors results in memory error in numpy/scipy (depending on which vectorizer I use). So my question is:

When extracting text features from the raw text: if I fit the data to the vectorizer in chunks, will that be the same as fitting the entire data at once?

To illustrate this with code, is the following:

vectoriser = CountVectorizer() # or TfidfVectorizer/HashingVectorizer
train_vectors = vectoriser.fit_transform(train_data)

different from the following:

vectoriser = CountVectorizer() # or TfidfVectorizer/HashingVectorizer


start = 0
while start < len(train_data):
    vectoriser.fit(train_data[start:(start+500)])
    start += 500

train_vectors = vectoriser.transform(train_data)

Thanks in advance and sorry if this question is completely retarded.

Is there any place in your code where you could free up your memory, e.g. by getting rid of the references to already parsed text data? Or could you switch to 64 bit Python and/or a different OS (Windows will cap the memory allocated to every process, Ubuntu won't, for example). — Aleksander Lidtke
@AleksanderLidtke, thank you for the comment. To answer your questions: 1. I can't free up more memory. I don't load the entire text files into memory, I only pass filenames to scikit-learn vectorizer. Meaning if any optimisation of memory usage could be done, that is inside scikit-learn. 2. Changing to another OS will be problematic. 3. I don't think the problem is because of using too much memory anyway. I've seen my python applications using up to 1.8GB of RAM before, while now it crashes at about 500MB. Should I probably open another question and post the exception traces I receive? — DarkMatter
No problem. And right, I see. I don't know the underlying workings of scikit, I only use a few parts of it. Perhaps you could read the data yourself, then extract the relevant bits, and use those with some more memory efficient function of scikit? Or write to the developers? Also, could be that the process that runs out of memory is a child process of your Python process. Look to see if there's anything taking up loads of memory when you run the script. — Aleksander Lidtke
@DarkMatter scikit-learn has no problem using all 8GB of my RAM (and more). It is possible that you get a memory error because it wants to allocate a huge matrix in memory though. So the RAM is actually never used but you still have a memory issue. Looks like a legit question to me ;). — ldirer

ldirer ldirer · Accepted Answer · 2015-08-04T09:38:15

It depends on the vectorizer you are using.

CountVectorizer counts occurences of the words in the documents. It outputs for each document a (n_words, 1) vector with the number of times each word appears in the document. n_words is the total number of words in the documents (aka the size of the vocabulary).
It also fits a vocabulary so that you can introspect the model (see what word is important, etc.). You can have a look at it using vectorizer.get_feature_names().

When you fit it on your first 500 documents, the vocabulary will only be made of the words from the 500 documents. Say there are 30k of them, fit_transform outputs a 500x30k sparse matrix.
Now you fit_transform again with the 500 next documents, but they contain only 29k words so you get a 500x29k matrix...
Now, how do you align your matrices to make sure all documents have a consistent representation?
I can't think of an easy way to do this at the moment.

With TfidfVectorizer you have another issue, that is the inverse document frequency: to be able to compute document frequency you need to see all the documents at once.
However a TfidfVectorizer is just a CountVectorizer followed by a TfIdfTransformer, so if you manage to get the output of the CountVectorizer right you can then apply a TfIdfTransformer on the data.

With HashingVectorizer things are different: there is no vocabulary here.

In [51]: hvect = HashingVectorizer() 
In [52]: hvect.fit_transform(X[:1000])       
<1000x1048576 sparse matrix of type '<class 'numpy.float64'>'
 with 156733 stored elements in Compressed Sparse Row format>

Here there are not 1M+ different words in the first 1000 documents, yet the matrix we get has 1M+ columns.
The HashingVectorizer does not store the words in memory. This makes it more memory efficient and makes sure that the matrices it returns always have the same number of columns. So you don't have the same problem as with the CountVectorizer here.

This is probably the best solution for the batch processing you described. There are a couple of cons, namely that you cannot get the idf weighting and that you do not know the mapping between words and your features.

The HashingVectorizer documentation references an example that does out-of-core classification on text data. It may be a bit messy but it does what you'd like to do.

Hope this helps.

EDIT: If you have too much data, HashingVectorizer is the way to go. If you still want to use CountVectorizer, a possible workaround is to fit the vocabulary yourself and to pass it to your vectorizer so that you only need to call tranform.

Here's an example you can adapt:

import re
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

news = fetch_20newsgroups()
X, y = news.data, news.target

Now the approach that does not work:

# Fitting directly:
vect = CountVectorizer()
vect.fit_transform(X[:1000])
<1000x27953 sparse matrix of type '<class 'numpy.int64'>'
with 156751 stored elements in Compressed Sparse Row format>

Note the size of the matrix we get.
Fitting the vocabulary 'manually':

def tokenizer(doc):
    # Using default pattern from CountVectorizer
    token_pattern = re.compile('(?u)\\b\\w\\w+\\b')
    return [t for t in token_pattern.findall(doc)]

stop_words = set() # Whatever you want to have as stop words.
vocabulary = set([word for doc in X for word in tokenizer(doc) if word not in stop_words])

vectorizer = CountVectorizer(vocabulary=vocabulary)
X_counts = vectorizer.transform(X[:1000])
# Now X_counts is:
# <1000x155448 sparse matrix of type '<class 'numpy.int64'>'
#   with 149624 stored elements in Compressed Sparse Row format>
#   
X_tfidf = tfidf.transform(X_counts)

On your example you'll need to first build the entire matrix X_counts (for all documents) before applying the tfidf transform.

scikit-learn: fitting data into chunks vs fitting it all at once

2 Answers