sklearn Incremental Pca large dataset

Question

I have a large dataset of size 42.9 GB which are stored as numpy's compressed npz format. The data when loaded has

n_samples, n_features = 406762, 26421

I need to perform dimensionality reduction on this and hence using sklearn's PCA methods. Usually, I perform

from sklearn.decomposition import IncrementalPCA, PCA

pca = PCA(n_components=200).fit(x)
x_transformed = pca.transform(x)

Since the data can't be loaded into memory, I am using Incremental PCA as it has out-of-core support by providing partial_fit method.

from sklearn.decomposition import IncrementalPCA, PCA

ipca = IncrementalPCA(n_components=200)

for x in xrange(407):
    partial_x = load("...")
    ipca.partial_fit(partial_x)

Now, once the model is fit with complete data, how do I perform transform? As transform takes the entire data and there is no partial_transform method given.

Edit: #1

Once the reduced dimensional representation of the data is calculated, this is how I'm verifying the reconstruction error.

from sklearn.metrics import mean_squared_error

reconstructed_matrix = pca_model.inverse_transform(reduced_x)
error_curr = mean_square_error(reconstructed_x, x)

How do I calculate the error for the large dataset? Also, Is there a way I can use the partial_fit as part of the GridSearch or RandomizedSearch to find the best n_components?

Transform will transform whatever is passed. You can pass the data one sample at a time. No need of whole data to be present at a time — Vivek Kumar

adrin adrin · Accepted Answer · 2018-04-02T12:34:20

You can do it the same way you're fitting your model. The transform function doesn't have to be applied to the whole data at once.

x_transform = np.ndarray(shape=(0, 200))
for x in xrange(407):
    partial_x = load("...")
    partial_x_transform = ipca.transform(partial_x)
    x_transform = np.vstack((x_transform, partial_x_transform))

To calculate the mean squared error for the reconstruction, you can use a code such as the following:

from sklearn.metrics import mean_squared_error

sum = 0
for i in xrange(407):
    # with a custom get_segment function
    partial_x_reduced = get_segment(x_reduced, i)
    reconstructed_matrix = pca_model.inverse_transform(partial_reduced_x)
    residual = mean_square_error(reconstructed_x, get_segment(x, i))
    sum += residual * len(partial_x_reduced)

mse = sum / len(x_reduced)

For the parameter tuning, you can set the number of components to the maximum value you want, transform your input, and then in your grid search, only use the first k columns, k being your hyper-parameter. You don't have to recalculate the whole PCA each time you change k.

sklearn Incremental Pca large dataset

1 Answers