2
votes

I am using scikit-learn PCA to find the principal components of a dataset with about 20000 features and 400+ samples.

However, comparing with Orange3 PCA which should be using scikit-learn PCA, I get different results. I also unchecked the normalization option proposed by Orange3 PCA.

With scikit-learn the first Principal Component accounts for ~14% of total variance, the second for ~13% and so on.

With Orange3 I get a very different result (~65% of variance for the first Principal Component and so on):

Orange3 PCA output

My code using scikit-learn is the following:

import pandas as pd
from sklearn.decomposition import PCA
matrix = pd.read_table("matrix.csv", sep='\t', index_col=0)
sk_pca = PCA(n_components=None)
result = sk_pca.fit(matrix.T.values)
print(result.explained_variance_ratio_)

With Orange3, I loaded the csv using the file block. Then I connected this block to the PCA block, in which I unchecked the normalization option.

Where is the difference between the two methods?

3

3 Answers

1
votes

Probably has something to do with Orange's PCA preprocessors, or the way you load data. PCA contains the following two preprocessors:

  • continuization (for making categorical, or determined-to-be-categorical, values into continuous, e.g. via one-hot transform), and
  • imputation (for replacing nans with mean values, for example).

Ensure you load your data without any nan values and with Orange's three line header, marking all the features continuous so no transformations will be made.

1
votes

Thanks to K3---rnc's answer, I inspected how I loaded data.

But the data were correctly loaded, there were no missing data. The problem was that Orange3 loads the data putting the features on the columns and the samples on the rows, which is the opposite of what I was expecting it to do.

So I transposed the data and the result is the same of the result given by the scikit-learn module:

PCA corrected

Thanks

0
votes

Maybee the difference is due to the normalisation.The one of sklearn divides by the Pearson std (n-1) and not by std (n).It can explain small diff in case of small sample.

Check that For std

df
df2 = df.mean()
df2 = pd.DataFrame(df2,columns = ['Mean'])

#Calculer l'ecart type de chaque variable
df3 = df.std()
df3 = pd.DataFrame(df3,columns = ['Standard Deviation'])

#Centrer la matrice : faire la difference entre la matrice df et la moyenne de chaque variable
df4 = df.sub(df.mean(axis=0), axis=1)

#Reduire la matrice : diviser la matrice centree par son ecart type
import numpy as np
df5 = df4.divide(df.std(axis=0), axis=1)
df5 = df5.replace(np.nan, 0)

Pearson std from sklearn import preprocessing

df=pd.DataFrame(preprocessing.scale(df), index = df.index, columns = df.columns)