3
votes

Problem: OutOfMemory error is showing on applying the PCA on 8 million features.

Here is my code snipet:-

from sklearn.decomposition import PCA as sklearnPCA
sklearn_pca = sklearnPCA(n_components=10000)
pca_tfidf_sklearn = sklearn_pca.fit(traindata_tfidf.toarray())

I want to apply the PCA / dimension reduction techniques on text extracted features (using tf-idf). Currently I am having around 8 million such feature and I want to reduce those features and to classify the documents I am using the MultiNomialNB.

And I am stucked due to the OutOfMemory error.

2

2 Answers

2
votes

I have had a similar problem. Using a Restricted Boltzmann Machine (RBM) instead of PCA fixed the problem. Mathematically, this is because PCA only looks at the EigenValues and EigenVectors of your feature matrix whereas RBM works as a neural network to consider all multiplicative possibilities of the features in your data. Therefore, RBM has a much greater set to consider when deciding which features are more important. It then reduces the quantity of features to a much smaller size with more important features than PCA can. However, be sure to Feature Scale and Normalize the data before applying an RBM to the data.

1
votes

I suppose, traindata_tfidf is actually in a sparse form. Try using one of scipy sparse formats instead of an array. Also take a look at SparsePCA methods, and if it doesn't help, use MiniBatchSparsePCA.