
I need to get matrix of TF-IDF features from the text stored in columns of a huge dataframe, loaded from a CSV file (which cannot fit in memory). I am trying to iterate over dataframe using chunks but it is returning generator objects which is not an expected variable type for the method TfidfVectorizer. I guess I am doing something wrong while writing a generator method ChunkIteratorshown below.

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

#Will work only for small Dataset
csvfilename = 'data_elements.csv'
df = pd.read_csv(csvfilename)
vectorizer = TfidfVectorizer()
corpus  = df['text_column'].values

#Trying to use a generator to parse over a huge dataframe
def ChunkIterator(filename):
    for chunk in pd.read_csv(csvfilename, chunksize=1):
       yield chunk['text_column'].values

corpus  = ChunkIterator(csvfilename)

Can anybody please advise how to modify the ChunkIterator method above, or any other approach using dataframe. I would like to avoid creating separate text files for each row in the dataframe. Following is some dummy csv file data for recreating the scenario.

001, This is the first document .,['sports','entertainment']
002, This document is the second document .,"['politics', 'asia']"
003, And this is the third one .,['europe','nato']
004, Is this the first document ?,"['sports', 'soccer']"

1 Answers


The method accepts generators just fine. But it requires a iterable of raw documents, i.e. strings. Your generator is an iterable of numpy.ndarray objects. So try something like:

def ChunkIterator(filename):
    for chunk in pd.read_csv(csvfilename, chunksize=1):
        for document in chunk['text_column'].values:
            yield document

Note, I don't really understand why you are using pandas here. Just use the regular csv module, something like:

import csv
def doc_generator(filepath, textcol=0, skipheader=True):
    with open(filepath) as f:
        reader = csv.reader(f)
        if skipheader:
            next(reader, None)
        for row in reader:
            yield row[textcol]

So, in your case, pass 1 to textcol, for example:

In [1]: from sklearn.feature_extraction.text import TfidfVectorizer

In [2]: import csv
   ...: def doc_generator(filepath, textcol=0, skipheader=True):
   ...:     with open(filepath) as f:
   ...:         reader = csv.reader(f)
   ...:         if skipheader:
   ...:             next(reader, None)
   ...:         for row in reader:
   ...:             yield row[textcol]

In [3]: vectorizer = TfidfVectorizer()

In [4]: result = vectorizer.fit_transform(doc_generator('testing.csv', textcol=1))

In [5]: result
<4x9 sparse matrix of type '<class 'numpy.float64'>'
    with 21 stored elements in Compressed Sparse Row format>

In [6]: result.todense()
matrix([[ 0.        ,  0.46979139,  0.58028582,  0.38408524,  0.        ,
          0.        ,  0.38408524,  0.        ,  0.38408524],
        [ 0.        ,  0.6876236 ,  0.        ,  0.28108867,  0.        ,
          0.53864762,  0.28108867,  0.        ,  0.28108867],
        [ 0.51184851,  0.        ,  0.        ,  0.26710379,  0.51184851,
          0.        ,  0.26710379,  0.51184851,  0.26710379],
        [ 0.        ,  0.46979139,  0.58028582,  0.38408524,  0.        ,
          0.        ,  0.38408524,  0.        ,  0.38408524]])