Apply PCA on classification data, category wise or on complete dataset?

Question

I have a classification related image data with 15 different classes and each class has five feature sets. Those five feature sets comprise of colour features, sift features etc.. upto 5 different features. The average number of instances/samples in each class is around 300 (varying from 200 to 400). The dimension of feature sets are 512, 1296, 5376, 5376 and 22950. Total number of samples is near to 4500.

(For clarity: Say for one class and one colour feature, I have a matrix with 220 rows (samples) and each row is 5376 dimensional vector, thus a 220 x 5376 dimensional matrix representing one class and one feature).

Now if I apply PCA on individual category/class then I will obtain the reduced dimension of all feature sets less than 270 ( n_components = min(n_samples, feature_dimension)).

If I apply PCA on complete dataset of 4500 images (concatenating all samples from 15 classes), of-course on one feature set, say colour.. then I'll obtain a dataset of reduced dimension less than min(4500, feature_dimension).

What is the most appropriate way to apply PCA? On category wise data (per feature) or on complete dataset of one feature? Note that, I need to fix the number of principal components to account for above 90% variance.

Happy to receive some help!!

Josep Valls Josep Valls · Accepted Answer · 2016-01-20T04:20:10

I'd recommend you to experiment with both approaches. Dump the data into an ARFF file (similar to a CSV with some header) and open it in Weka (http://www.cs.waikato.ac.nz/ml/weka/). You will be able to easily explore different scenarios, visualize the dimensionality reduction and even check some feature selection algorithms.

Apply PCA on classification data, category wise or on complete dataset?

1 Answers