0
votes

Say we have a dataset of a large dimension, which we have reduced to a lower dimension using PCA, would it be wise/accurate to then use a clustering algorithm on said data? Assuming that we do not know how many clusters to expect.

Using PCA on the Iris dataset(with the data in the csv ordered such that all of the first class are listed, then the second, then the third) yields the following plot:- Ordered data run through PCA

It can be seen that the three classes in the Iris dataset have been retained. However, when the order of the samples is randomised, the following plot is produced:- Unordered data run thorough PCA

Above, it is not clear how many clusters/classes are contained in the data set. In this case(the more real world case), how would one identify the number of classes, would a clustering algorithm such as K-Means be effective?

Would there be innacuracies due to the discarding of lower order Principal Components?

EDIT:- To be clear, I am asking if a dataset can be clustered after running PCA, and if so, what the most accurate method would be.

2
What exactly have you plotted? I have plotted iris's PCA a while ago, and on the first two reduced dimensions (containing the most variance) the clusters were visible.Thomas Jungblut
Hi, I am plotting the product of the first Principal Component Eigenvector, and the original(zero mean) data set.Jack H
Make a histogram, instead of just plotting the points.Don Reba

2 Answers

1
votes

Say we have a dataset of a large dimension, which we have reduced to a lower dimension using PCA, would it be wise/accurate to then use a clustering algorithm on said data? Assuming that we do not know how many clusters to expect.

Your data might well separate in a low-variance dimension. I would not recommend running PCA prior to clustering.

Above, it is not clear how many clusters/classes are contained in the data set. In this case(the more real world case), how would one identify the number of classes, would a clustering algorithm such as K-Means be effective?

There are effective clustering algorithms that do not require prior knowledge of the number of classes, such as Mean Shift and DBSCAN.

0
votes

Try sorting the dataset after PCA, then plotting it.

The iris data set is much to simple to draw any valid conclusions about the behaviour of high-dimensional data, and the benefits of PCA.

Plus, "wise" - in which sense? If you want to eat pizza, it is not wise to plot the iris data set.