Different results between Orange PCA and scikit-learn PCA

Question

I am using scikit-learn PCA to find the principal components of a dataset with about 20000 features and 400+ samples.

However, comparing with Orange3 PCA which should be using scikit-learn PCA, I get different results. I also unchecked the normalization option proposed by Orange3 PCA.

With scikit-learn the first Principal Component accounts for ~14% of total variance, the second for ~13% and so on.

With Orange3 I get a very different result (~65% of variance for the first Principal Component and so on):

My code using scikit-learn is the following:

import pandas as pd
from sklearn.decomposition import PCA
matrix = pd.read_table("matrix.csv", sep='\t', index_col=0)
sk_pca = PCA(n_components=None)
result = sk_pca.fit(matrix.T.values)
print(result.explained_variance_ratio_)

With Orange3, I loaded the csv using the file block. Then I connected this block to the PCA block, in which I unchecked the normalization option.

Where is the difference between the two methods?

K3---rnc K3---rnc · Accepted Answer · 2016-04-06T09:39:11

Probably has something to do with Orange's PCA preprocessors, or the way you load data. PCA contains the following two preprocessors:

continuization (for making categorical, or determined-to-be-categorical, values into continuous, e.g. via one-hot transform), and
imputation (for replacing nans with mean values, for example).

Ensure you load your data without any nan values and with Orange's three line header, marking all the features continuous so no transformations will be made.

Different results between Orange PCA and scikit-learn PCA

3 Answers