I have a large dataset of size 42.9 GB which are stored as numpy's compressed npz format. The data when loaded has
n_samples, n_features = 406762, 26421
I need to perform dimensionality reduction on this and hence using sklearn's PCA methods. Usually, I perform
from sklearn.decomposition import IncrementalPCA, PCA
pca = PCA(n_components=200).fit(x)
x_transformed = pca.transform(x)
Since the data can't be loaded into memory, I am using Incremental PCA as it has out-of-core support by providing partial_fit method.
from sklearn.decomposition import IncrementalPCA, PCA
ipca = IncrementalPCA(n_components=200)
for x in xrange(407):
partial_x = load("...")
ipca.partial_fit(partial_x)
Now, once the model is fit with complete data, how do I perform transform? As transform takes the entire data and there is no partial_transform method given.
Edit: #1
Once the reduced dimensional representation of the data is calculated, this is how I'm verifying the reconstruction error.
from sklearn.metrics import mean_squared_error
reconstructed_matrix = pca_model.inverse_transform(reduced_x)
error_curr = mean_square_error(reconstructed_x, x)
How do I calculate the error for the large dataset? Also, Is there a way I can use the partial_fit as part of the GridSearch or RandomizedSearch to find the best n_components?