3
votes

I am currently working on an image recognition project with machine learning.

  • The train set has 1600 images with size 300x300, so 90000 features per image.
  • To speed up training, I apply PCA with n_components = 50
  • The test set has 450 images and I can test the model in this test set successfully.

Now, I want to predict a single image that is captured by webcam. The question is that should I apply PCA to that image?

  • If I don't apply PCA, I get ValueError: X.shape[1] = 90000 should be equal to 50, the number of features at training time
  • If I apply PCA, I get ValueError: n_components=50 must be between 0 and min(n_samples, n_features)=1 with svd_solver='full'

I use Python 3, scikit-learn 0.20.3, this is how I apply PCA:

from sklearn.decomposition import PCA
pca = PCA(50)
pca.fit_transform(features)
2

2 Answers

3
votes

You need to apply PCA on your test set as well.

You need to consider what PCA does:

PCA constructs a new features set (containing less features than the original feature space) and then you subsequently train on this new feature set. You need to construct this new feature set for the test set for your model to be valid!

Its important to note that each feature in your 'reduced' feature set are a linear combination of the original features, where for a given number of new features (n_components) they are the feature set that maximize the variance of the original space preserved in the new space.

Practically to perform the relevant transformation on your test set, you need to do:

# X_test - your untransformed test set

X_test_reduced = pca.transform(X_test)

where pca is the instance of PCA() trained on your training set. Essentialy you are constructing a transformation to a lower-dimensional space and you want this transformation to be the same for the training and test set! If you train pca independently on both the training and test set, you are (nearly certainly) embedding the data into different low-dimensional representations and have different feature sets.

1
votes

Yes, you need to apply PCA, following the principle of doing the same things to data during training and testing.

However, the key thing is that you must not "retrain"/fit the PCA again. Use PCA transform

pca.transform(X_test) #where X_test is a collection of images for testing, should be similar to your features.

The idea being, fit_transform is a two step process made up of fitting a PCA, and then transforming the datasets accordingly.