1
votes

I am using following code to perform PCA on iris dataset:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns 

# get iris data to a dataframe: 
from sklearn import datasets
iris = datasets.load_iris() 
varnames = ['SL', 'SW', 'PL', 'PW']
irisdf = pd.DataFrame(data=iris.data, columns=varnames)
irisdf['Species'] = [iris.target_names[a] for a in iris.target]

# perform pca: 
from sklearn.decomposition import PCA
model = PCA(n_components=2)
scores = model.fit_transform(irisdf.iloc[:,0:4])
loadings = model.components_

# plot results: 
scoredf = pd.DataFrame(data=scores, columns=['PC1','PC2'])
scoredf['Grp'] = irisdf.Species
sns.lmplot(fit_reg=False, x="PC1", y='PC2', hue='Grp', data=scoredf) # plot point; 
loadings = loadings.T
for e, pt in enumerate(loadings):
    plt.plot([0,pt[0]], [0,pt[1]], '--b') 
    plt.text(x=pt[0], y=pt[1], s=varnames[e], color='b')
plt.show()

I am getting following plot:

enter image description here

However, when I compare with plots from other sites (e.g. at http://marcoplebani.com/pca/ ), my plot is not correct. Following differences seem to be present:

  1. Petal length and petal width lines should have similar lengths.
  2. Sepal length line should be closer to petal length and petal width lines rather than closer to sepal width line.
  3. All 4 lines should be on the same side of x-axis.

Why is my plot not correct. Where is the error and how can it be corrected?

1
I don't know enough about the technical details of PCA so I can't say for sure, but this might occur because the signs of loadings and scores of PCA are arbitrary. Here's a reference: ncbi.nlm.nih.gov/pmc/articles/PMC4792409willk
Relevant section from the reference: "Equation (2.1) remains valid if the eigenvectors are multiplied by −1, and so the signs of all loadings (and scores) are arbitrary and only their relative magnitudes and sign patterns are meaningful."willk

1 Answers

1
votes

It depends on whether you scale the variance or not. The "other site" uses scale=TRUE. If you want to do this with sklearn, add StandardScaler before fitting the model and fit the model with scaled data, like this:

from sklearn.preprocessing import StandardScaler
X = StandardScaler().fit_transform(irisdf.iloc[:,0:4])
scores = model.fit_transform(X)

enter image description here

Edit: Difference between StandardScaler and normalize

Here is an answer which pointed out a key difference (row vs column). Even you use normalize here, you might want to consider X = normalize(X.T).T. The following code shows some differences after transformation:

import pandas as pd
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler, normalize

iris = datasets.load_iris() 
varnames = ['SL', 'SW', 'PL', 'PW']

fig, ax = plt.subplots(2, 2, figsize=(16, 12))

irisdf = pd.DataFrame(data=iris.data, columns=varnames)
irisdf.plot(kind='kde', title='Raw data', ax=ax[0][0])

irisdf_std = pd.DataFrame(data=StandardScaler().fit_transform(irisdf), columns=varnames)
irisdf_std.plot(kind='kde', title='StandardScaler', ax=ax[0][1])

irisdf_norm = pd.DataFrame(data=normalize(irisdf), columns=varnames)
irisdf_norm.plot(kind='kde', title='normalize', ax=ax[1][0])

irisdf_norm = pd.DataFrame(data=normalize(irisdf.T).T, columns=varnames)
irisdf_norm.plot(kind='kde', title='normalize', ax=ax[1][1])

plt.show()

enter image description here

I'm not sure how deep I can go with the algorithm/math. The point for StandardScaler is to get uniform/consistent mean and variance across features. The assumption is that variables with large measurement units are not necessarily (and should not be) dominant in PCA. In other word, StandardScaler makes features contribute equally to PCA. As you can see, normalize won't give consistent mean or variance.