1
votes

I have a feature vector of size [4096 x 180], where 180 is the number of samples and 4096 is the feature vector length of each sample.

I want to reduce the dimensionality of the data using PCA.

I tried using the built in pca function of MATLAB [V U]=pca(X) and reconstructed the data by X_rec= U(:, 1:n)*V(:, 1:n)', n being the dimension I chose. This returns a matrix of 4096 x 180.

Now I have 3 questions:

  1. How to obtain the reduced dimension?
  2. When I put n as 200, it gave an error as matrix dimension increased, which gave me the assumption that we cannot reduce dimension lesser than the sample size. Is this true?
  3. How to find the right number of reduced dimensions?

I have to use the reduced dimension feature set for further classification.

If anyone can provide a detailed step by step explanation of the pca code for this I would be grateful. I have looked at many places but my confusion still persists.

1
I did some major improvement of formatting.zx485

1 Answers

3
votes

You may want to refer to Matlab example to analyse city data.

Here is some simplified code:

load cities;
[~, pca_scores, ~, ~, var_explained] = pca(ratings);

Here, pca_scores are the pca components with respective variances of each component in var_explained. You do not need to do any explicit multiplication after running pca. Matlab will give you the components directly.

In your case, consider that data X is a 4096-by-180 matrix, i.e. you have 4096 samples and 180 features. Your goal is to reduce dimensionality such that you have p features, where p < 180. In Matlab, you can simply run the following,

p = 100;    
[~, pca_scores, ~, ~, var_explained] = pca(X, 'NumComponents', p);

pca_scores will be a 4096-by-p matrix and var_explained will be a vector of length p.

To answer your questions:

  1. How to obtain the reduced dimension? I above example, pca_scores is your reduced dimension data.
  2. When I put n as 200, it gave an error as matrix dimension increased, which gave me the assumption that we cannot reduce dimension lesser than the sample size. Is this true? You can't use 200, since the reduced dimensions have to be less than 180.
  3. How to find the right number of reduced dimensions? You can make this decision by checking the var_explained vector. Typically you want to retain about 99% variance of the features. You can read more about this here.