Memory error using cv.fit_transform(corpus).toarray()

Question

I would be grateful if anyone can help with cv.fit_transform(corpus).toarray() for handling a corpus of size approx 732066 x <140 (tweets). The text has been cleaned to reduce the features and dimensionality but I keep getting the error below

Here's how I started

# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd


# Importing the dataset
cols = ["text","geocoordinates0","geocoordinates1","grid"]
dataset = pd.read_csv('tweets.tsv', delimiter = '\t', usecols=cols, quoting = 3, error_bad_lines=False, low_memory=False)

# Removing Non-ASCII characters
def remove_non_ascii_1(dataset):
    return ''.join([i if ord(i) < 128 else ' ' for i in dataset])

# Cleaning the texts
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
corpus = []
for i in range(0, 732066):
    review = re.sub('[^a-zA-Z]', ' ', str(dataset['text'][i]))
    review = review.lower()
    review = review.split()
    ps = PorterStemmer()
    review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))]
    review = ' '.join(review)
    corpus.append(review)

# Creating the Bag of Words model
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
X = cv.fit_transform(corpus).toarray()
y = dataset.iloc[:, 3].values

# Splitting the dataset into the Training set and Test set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)

# Fitting Naive Bayes to the Training set
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, y_train)

# Predicting the Test set results
y_pred = classifier.predict(X_test)

# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)

# Applying k-Fold Cross Validation
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10)
accuracies.mean()
accuracies.std()

And here's the output error gotten below:

X = cv.fit_transform(corpus).toarray() Traceback (most recent call last):

File "", line 1, in X = cv.fit_transform(corpus).toarray()

File "C:\Anaconda3\envs\py35\lib\site-packages\scipy\sparse\compressed.py", line 920, in toarray return self.tocoo(copy=False).toarray(order=order, out=out)

File "C:\Anaconda3\envs\py35\lib\site-packages\scipy\sparse\coo.py", line 252, in toarray B = self._process_toarray_args(order, out)

File "C:\Anaconda3\envs\py35\lib\site-packages\scipy\sparse\base.py", line 1009, in _process_toarray_args return np.zeros(self.shape, dtype=self.dtype, order=order)

MemoryError

Many Thanks ahead!

PS: After removing the arraylist and using MultinomiaNB as advised by @Kumar I now have the following error:

from sklearn.naive_bayes import MultinomialNB 
classifier = MultinomialNB()
classifier.fit(X_train, y_train)

Traceback (most recent call last):

File "", line 1, in classifier.fit(X_train, y_train)

File "C:\Anaconda3\envs\py35\lib\site-packages\sklearn\naive_bayes.py", line 566, in fit Y = labelbin.fit_transform(y)

File "C:\Anaconda3\envs\py35\lib\site-packages\sklearn\base.py", line 494, in fit_transform return self.fit(X, **fit_params).transform(X)

File "C:\Anaconda3\envs\py35\lib\site-packages\sklearn\preprocessing\label.py", line 296, in fit self.y_type_ = type_of_target(y)

File "C:\Anaconda3\envs\py35\lib\site-packages\sklearn\utils\multiclass.py", line 275, in type_of_target if (len(np.unique(y)) > 2) or (y.ndim >= 2 and len(y[0]) > 1):

File "C:\Anaconda3\envs\py35\lib\site-packages\numpy\lib\arraysetops.py", line 198, in unique ar.sort()

TypeError: unorderable types: str() > float()

This error is because the output of CountVectorizer is a sparse matrix and when you call toarray() on it, it becomes a normal array taking a whole lot greater memory that its sparse counterpart. Why do you want to do toarray()? Can you show the code below it, I mean in which you are sending the X and y. Most estimators support sparse matrix. — Vivek Kumar
Ok, I see that you are using the GaussianNB. Unfortunately that doesn't seem to support the sparse matrix. Is there any specific reason you are using that? If not, then you can use MultinomialNB for the classification task. See this issue for more details. — Vivek Kumar
I can use the MNB you mentioned but I still need to transform the corpus into an array in creating the bag of words before the ML classifier algorithm is applied. This is where I'm stuck — Seun AJAO
No, thats what I am saying. Do X = cv.fit_transform(corpus), and use X in all below code. No need to create array from it as far as training is considered. Even during testing you can use the sparse matrix. — Vivek Kumar
@Kumar please for the sake of clarity, can you post your amended version of the code as an answer to the question. Thanks — Seun AJAO

Vivek Kumar Vivek Kumar · Accepted Answer · 2017-06-06T11:12:38

All I was saying was, remove .toarray() and replace GaussianNB with MultinomialNB.

.... 
....
# Other code
....
....

X = cv.fit_transform(corpus)
y = dataset.iloc[:, 3].values

# Splitting the dataset into the Training set and Test set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)

# Fitting Naive Bayes to the Training set
from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB()
classifier.fit(X_train, y_train)

# Predicting the Test set results
y_pred = classifier.predict(X_test)

.... 
....
# Other code

Memory error using cv.fit_transform(corpus).toarray()

1 Answers