1
votes

I am using the Matlab Classification Learner app to test different classifiers over a training set (size = 700). My response variable is a categorical label with 5 possible values. I have 7 numerical features and 2 categorical ones. I found a Cubic SVM to have the highest accuracy of 83%. But the performance goes down considerably when I enable PCA with 95% explained variance (accuracy = 40.5%). I am a student and this is the first time I am using PCA.

  1. Why do I see such a result?
  2. Could it be because of a small / unbalanced data set?
  3. When is it useful to apply PCA? When we say "reduce dimensionality", is there a minimum number of features (dimensionality) in the original set?

Any help is appreciated. Thanks in advance!

1
PCA assumes Gaussian distributed inputs, I doesn't work for categorical data at allAnder Biguri
PCA combines all inputs linearly. This is a mapping to a different space (of the same dimension unless you drop certain dimensions). You may loose or mask certain nonlinear correlationsmax

1 Answers

0
votes

I want to share my opinion

I think training set 700 means, your data is < 1k.

  1. I'm even surprised that svm performs 83%.
  • Even MNIST dataset is considered to be small (60.000 training - 10.000 test). Your data is much-much smaller.

  • You try to reduce your small data even smaller using pca. So what will svm learns? There is no discriminating samples left?

  • If I were you I would test using random-forest classifier. Random-forest might even perform better.

  1. Even if you balanced your data, it is small data.
  • I believe using SMOTE will not improve the result. If your data consist of images then you could use ImageDataGenerator for replicating your data. Though I'm not sure matlab contains ImageDataGenerator.
  1. You will use PCA, when you have lots of samples. Yet the samples are not directly effecting the accuracy but they are the components of data.
  • For instance: Let's consider handwritten digit classification data.

enter image description here

From above can we say each pixel is directly effecting the accuracy?

The answer is no? Above the black pixels are not important for the accuracy, therefore to remove them we use pca.

If you want a detailed explanation with a python example. Check out my other answer