PCA before K-mean clustering

Question

If I applied PCA on feature vectors and then I do clustering, such like following:

reduced_data = PCA(n_components=2).fit_transform(data)
kmeans = KMeans(init='k-means++', n_clusters=n_digits, n_init=10)
kmeans.fit(reduced_data)

The reduced data will be the in terms of PCA components, so after clustering in kmean, you can get a label for each point (reduced_data), how to know which one from the origin data?
how to play with a number of PCA components regarding the number of clusters? Thanks.

Nikolas Rieble Nikolas Rieble · Accepted Answer · 2017-02-28T08:50:42

PCA reduces the number of dimensions as you specified from n (unknown in your question) to n_components = 2. The labels do not change, the rows in the data matrix do not get switched. You can directly map the resulting clusters onto the original data.
The choice of n_components depends on the variance retained compared to the original data. Firstly, k-means is not robust, therefore you will have to initialize multiple times and compare the results with a given n_components. Secondly you would want to choose the variable n_components based on the associated eigenvalues which you could plot. Further, PCA is sensitive to scaling, therefore you should consider normalization before PCA. Therefore, to answer your question, the choice of n_components should result from thoughts about the variance to be retained and not from the number of clusters you want to achieve.

Another thought: Instead of using K-Means, you could use a clustering algorithm which does not require the target numbner of clusters as input such as DBSCAN.

PCA before K-mean clustering

1 Answers