1
votes

I am trying to train a SVM classifier using scikit-learn.. At training time I want to reduce the feature vector dimension. I have used PCA to reduce the dimension.

pp = PCA(n_components=400).fit(features)
features = pp.transform(features)

PCA requires m x n dataset to determine the variance. but at the time of inference I have only single image and corresponding 1d feature vector.. I am wondering how to reduce feature vector at inference time in order to match the training dimension.

3

3 Answers

1
votes

As all preprocessing modules in scikit-learn nowadays, PCA includes a transform method that does exactly that, i.e. it transforms new samples according to an already fitted PCA transformation; from the docs:

transform(self, X)

Apply dimensionality reduction to X.

X is projected on the first principal components previously extracted from a training set.

Here is a short demo with dummy data, adapting the example from the documentation:

import numpy as np
from sklearn.decomposition import PCA

X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
pca = PCA(n_components=2)
pca.fit(X)

X_new = ([[1, -1]]) # new data, notice the double array brackets

X_new_pca = pca.transform(X_new)
X_new_pca
# array([[-0.2935787 ,  1.38340578]])

If you want to avoid the double brackets for a single new sample, you should make it into a numpy array and reshape it as follows:

X_new = np.array([1, -1])
X_new_pca = pca.transform(X_new.reshape(1, -1))
X_new_pca
# array([[-0.2935787 ,  1.38340578]]) # same result
1
votes

After "training" the PCA (or mathematically speaking, after the dimensionality reduction matrix is computed), you can use the transform function on any matrix or vector with suitable sizes, regardless of the original data.

from sklearn.decomposition import PCA
import numpy as np

m = 100
n = 200

features = np.random.randn(m,n)
print(features.shape)
>> (100, 200)

# Learn the PCA
pp = PCA(n_components=50).fit(features)
low_dim_features = pp.transform(features)
print(low_dim_features.shape)
>> (100, 50)

# Perform dimensionality reduction to a new sample
new_sample = np.random.randn(1, n)
low_dim_sample = pp.transform(new_sample)
print(low_dim_sample.shape)
>> (1, 50)
1
votes

PCA can work perfectly fine for this case. It doesn’t matter whether you have a single image at test time or not. Assuming your training set is 100 samples by 1000 features. Fitting PCA on the training set will give you 1000 x N eigenvectors because you will have 1000 by 1000 covariance matrix. And through eignedecomposition you will have to select only a fraction of the eigenvectors. Say you select only 25, you will have 1000 x 25 eigenvectors. At test time, with a single example of 1 x 1000 features, you only need to project features to the eigenspace of 1000 x 25 and you eventually get 1 x 25 reduced features (your features will now be a dimension of 25 features). So your training set will have 100 x 25 features and your single test sample will have 1 x 25 features. You can train and test any machine learning classifier with that.