0
votes

I was doing PCA on a dataset. In order to find the optimal number of PCA's, I used the number of features as the number of PCA. However, when I looked at the explained variance ratio, I noticed that the number of PCA's has changed. Originally, the dataset was 200 * 300, so after doing PCA with # of components = 300, I should get 300 PCA's and their corresponding variance ratios back, but I got 200.

Code is here:

# Standardize the data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)

# Find the optimal number of PCA 
#pca.explained_variance_ratio_
pca = PCA()
pca.fit(X_train_scaled)
ratios = pca.explained_variance_ratio_

I just figured out why, so will answer this question below.

1

1 Answers

3
votes

This is actually due to the built-in setting of PCA in sklearn:

n_components : int, None or string Number of components to keep.

if n_components is not set all components are kept: n_components == min(n_samples, n_features)

Therefore, when our dataset has fewer samples than its features, PCA automatically chooses the number of samples as the number of components.