I read the following article
Anomaly detection with Principal Component Analysis (PCA)
In the article is written following:
• PCA algorithm basically transforms data readings from an existing coordinate system into a new coordinate system.
• The closer data readings are to the center of the new coordinate system, the closer these readings are to an optimum value.
• The anomaly score is calculated using the Mahalanobis distance between a reading and the mean of all readings, which is the center of the transformed coordinate system.
Can anyone describe me more in detail about anomaly detection using PCA (using PCA scores and Mahalanobis distance)? I'm confused because the definition of PCA is: PCA is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables“. How to use Mahalanobis distance when there is no more correlation between the variables?
Can anybody explain me how to do this in Spark? Does the pca.transform function returns the score where i should calculate the Mahalanobis distance for every reading to the center?