1
votes

I have a dataset of 50K rows and 26 features. I'm normalizing the columns using sklearn's StandardScaler (each column has 0 mean and 1 standard deviation), then running a PCA to reduce the featureset to ~90% of the original variance. I'm then normalizing the rows, before I run sklearn's KMeans algorithm.

Is there any reason I shouldn't be normalizing the rows after running a PCA? If there is, would normalizing the rows before the PCA cause any issues - should this be done before or after normalizing the columns?

The reason for normalizing the rows is to remove the 'magnitude' or 'skill level' from each row, and instead, look at the relationship between the respective PCA-reduced features.

1

1 Answers

1
votes

This is very dependent on the data. Since I don't know what these "skill level" numbers might have for data shape, I'm hesitant to give a direct answer. For instance, is it reasonable to have some rows with several normalized scores outside the [-1, 1] range, while others have values of small magnitude? It sounds like this is the case you're trying to address.

I worry that you'll have a lot of rows with several values in the 1-2 range (either + or -), but some rows with perhaps a single +1 value with the rest of the items near 0. When you normalize a "one-hot" row, you'll get that one value expanded larger than 10. Do you want it clustered as an outlier, or included in the central region of the space? Is someone with a single more-than-mediocre trait an outlier for this data?

There's nothing wrong with re-normalizing after a PCA. However, if you normalize both before and after, you won't get much change, since you kept a large majority of the data, removing only those that seem redundant.