I need to get matrix of TF-IDF features from the text stored in columns of a huge dataframe, loaded from a CSV file (which cannot fit in memory). I am trying to iterate over dataframe using chunks but it is returning generator objects which is not an expected variable type for the method TfidfVectorizer. I guess I am doing something wrong while writing a generator method ChunkIterator
shown below.
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
#Will work only for small Dataset
csvfilename = 'data_elements.csv'
df = pd.read_csv(csvfilename)
vectorizer = TfidfVectorizer()
corpus = df['text_column'].values
vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
#Trying to use a generator to parse over a huge dataframe
def ChunkIterator(filename):
for chunk in pd.read_csv(csvfilename, chunksize=1):
yield chunk['text_column'].values
corpus = ChunkIterator(csvfilename)
vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
Can anybody please advise how to modify the ChunkIterator
method above, or any other approach using dataframe. I would like to avoid creating separate text files for each row in the dataframe. Following is some dummy csv file data for recreating the scenario.
id,text_column,tags
001, This is the first document .,['sports','entertainment']
002, This document is the second document .,"['politics', 'asia']"
003, And this is the third one .,['europe','nato']
004, Is this the first document ?,"['sports', 'soccer']"