Sklearn: How to apply dimensionality reduction on huge data set?

Question

Problem: OutOfMemory error is showing on applying the PCA on 8 million features.

Here is my code snipet:-

from sklearn.decomposition import PCA as sklearnPCA
sklearn_pca = sklearnPCA(n_components=10000)
pca_tfidf_sklearn = sklearn_pca.fit(traindata_tfidf.toarray())

I want to apply the PCA / dimension reduction techniques on text extracted features (using tf-idf). Currently I am having around 8 million such feature and I want to reduce those features and to classify the documents I am using the MultiNomialNB.

And I am stucked due to the OutOfMemory error.

London Holmes London Holmes · Accepted Answer · 2016-01-15T02:37:33

I have had a similar problem. Using a Restricted Boltzmann Machine (RBM) instead of PCA fixed the problem. Mathematically, this is because PCA only looks at the EigenValues and EigenVectors of your feature matrix whereas RBM works as a neural network to consider all multiplicative possibilities of the features in your data. Therefore, RBM has a much greater set to consider when deciding which features are more important. It then reduces the quantity of features to a much smaller size with more important features than PCA can. However, be sure to Feature Scale and Normalize the data before applying an RBM to the data.

Sklearn: How to apply dimensionality reduction on huge data set?

2 Answers