It depends on the vectorizer you are using.
CountVectorizer counts occurences of the words in the documents.
It outputs for each document a (n_words, 1)
vector with the number of times each word appears in the document. n_words
is the total number of words in the documents (aka the size of the vocabulary).
It also fits a vocabulary so that you can introspect the model (see what word is important, etc.). You can have a look at it using vectorizer.get_feature_names()
.
When you fit it on your first 500 documents, the vocabulary will only be made of the words from the 500 documents. Say there are 30k of them, fit_transform
outputs a 500x30k
sparse matrix.
Now you fit_transform
again with the 500 next documents, but they contain only 29k words so you get a 500x29k
matrix...
Now, how do you align your matrices to make sure all documents have a consistent representation?
I can't think of an easy way to do this at the moment.
With TfidfVectorizer you have another issue, that is the inverse document frequency: to be able to compute document frequency you need to see all the documents at once.
However a TfidfVectorizer
is just a CountVectorizer
followed by a TfIdfTransformer
, so if you manage to get the output of the CountVectorizer
right you can then apply a TfIdfTransformer
on the data.
With HashingVectorizer things are different: there is no vocabulary here.
In [51]: hvect = HashingVectorizer()
In [52]: hvect.fit_transform(X[:1000])
<1000x1048576 sparse matrix of type '<class 'numpy.float64'>'
with 156733 stored elements in Compressed Sparse Row format>
Here there are not 1M+ different words in the first 1000 documents, yet the matrix we get has 1M+ columns.
The HashingVectorizer
does not store the words in memory. This makes it more memory efficient and makes sure that the matrices it returns always have the same number of columns.
So you don't have the same problem as with the CountVectorizer
here.
This is probably the best solution for the batch processing you described. There are a couple of cons, namely that you cannot get the idf weighting and that you do not know the mapping between words and your features.
The HashingVectorizer documentation references an example that does out-of-core classification on text data. It may be a bit messy but it does what you'd like to do.
Hope this helps.
EDIT:
If you have too much data, HashingVectorizer
is the way to go.
If you still want to use CountVectorizer
, a possible workaround is to fit the vocabulary yourself and to pass it to your vectorizer so that you only need to call tranform
.
Here's an example you can adapt:
import re
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
news = fetch_20newsgroups()
X, y = news.data, news.target
Now the approach that does not work:
# Fitting directly:
vect = CountVectorizer()
vect.fit_transform(X[:1000])
<1000x27953 sparse matrix of type '<class 'numpy.int64'>'
with 156751 stored elements in Compressed Sparse Row format>
Note the size of the matrix we get.
Fitting the vocabulary 'manually':
def tokenizer(doc):
# Using default pattern from CountVectorizer
token_pattern = re.compile('(?u)\\b\\w\\w+\\b')
return [t for t in token_pattern.findall(doc)]
stop_words = set() # Whatever you want to have as stop words.
vocabulary = set([word for doc in X for word in tokenizer(doc) if word not in stop_words])
vectorizer = CountVectorizer(vocabulary=vocabulary)
X_counts = vectorizer.transform(X[:1000])
# Now X_counts is:
# <1000x155448 sparse matrix of type '<class 'numpy.int64'>'
# with 149624 stored elements in Compressed Sparse Row format>
#
X_tfidf = tfidf.transform(X_counts)
On your example you'll need to first build the entire matrix X_counts (for all documents) before applying the tfidf transform.