2
votes

I'm trying to train a linear regression model. With GridSearchCV I want to investigate how the model performs with different numbers of dimensions after PCA. I also found a sklearn tutorial which does pretty much the same thing.

But first, my Code:

import pandas as pd
import sklearn.linear_model as skl_linear_model
import sklearn.pipeline as skl_pipeline
import sklearn.model_selection as skl_model_selection
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

model_lr = skl_linear_model.LinearRegression()

pca_lr = PCA()

pipeline = skl_pipeline.Pipeline([
            ('standardize', StandardScaler()),
            ('reduce_dim', pca_lr), 
            ('regressor', model_lr)])

n_components = list(range(1, len(X_train.columns)+1))
param_grid_lr = {'reduce_dim__n_components': n_components}

estimator_lr = skl_model_selection.GridSearchCV(
                pipeline,
                param_grid_lr,
                scoring='neg_root_mean_squared_error',
                n_jobs=2,
                cv=skl_model_selection.KFold(n_splits=25, shuffle=False, random_state=None),
                error_score=0,
                verbose=1,
                refit=True)

estimator_lr.fit(X_train, y_train)
grid_results_lr = pd.DataFrame(estimator_lr.cv_results_)

By the way, my training data are measurements in different units in the shape of a 8548x7 array. The Code seems to work so far, those are the cv_results. For the complexity of the problem the result is ok for linear regression (I'm also using other models which perform better).

If I understand this correctly, the image shows, that Principal Component 1 and 2 should explain the main part of the data since with those two the loss is almost minimized. Adding more Principal Components doesn't really improve the result, so their contribution to explained variance is probably rather low.

To prove this, I manually did a PCA, and this is were confusion kicks in:

X_train_scaled = StandardScaler().fit_transform(X_train)

pca = PCA()

PC_list = []
for i in range(1,len(X_train.columns)+1): PC_list.append(''.join('PC'+str(i)))

PC_df = pd.DataFrame(data=pca.fit_transform(X_train_scaled), columns=PC_list)

PC_loadings_df = pd.DataFrame(pca.components_.T,
                            columns=PC_list,
                            index=X_train.columns.values.tolist())

PC_var_df = pd.DataFrame(data=pca.explained_variance_ratio_,
                         columns=['explained_var'],
                         index=PC_list)

That's the explained variance ratio.

This seemed a little unexpected, so I checked the tutorial I mentioned at the beginning. And if I don't overlook something, the person was doing pretty much the same, except one thing:

When fitting the PCA they did not scale the data, even though they used a StandardScaler in their pipeline. Anyway the results they are getting are looking just fine.

So I tried the same and without standardization the explained variance ratio looks like this. It seems like this would explain my cv_results way better since PC 1 and 2 explain over 90 % of the variance.

But I'm wondering why they didn't scale the data before PCA. Every info I find about PCA says that the input need to be standardized. And this makes sense, since the data I have are measurements in different units.

So what am I missing? Is my initial approach actually correct and I just misinterpret the results (I'm new to PCA)? Is it possible that the first two Principal Components almost minimize the loss, even though they explain only around 50 % of the variance? Or could it even be, that the PCA in the pipeline does not actually scale the data, which is why the results of the CV go better with the not standardized manual PCA?

1

1 Answers

2
votes

Firstly, let me commend you for this excellent phrased question. I wish all newcomers were this elaborate.

To the question - I did not check correctness of the code, but only read the text and looked at the graphs. I will assume your analysis is correct.

I will only attempt to address

But I'm wondering why they didn't scale the data before PCA

and I advise to take this with a grain of salt, as I came to think about this same question a while back, and this is what I came up with. I have no reference for the following.


When should you, or shouldn't you scale the data?

You should scale the data if

  1. Your data are measurements of different units
  2. Your columns are of completely different scales (thus obviously one will dominate the variance)
  3. Your data are measurements of different sensors

You should not scale the data if

  1. Your data are different dimensions of the same measurement, such as 3d points - because you WANT the (for example) x axis to dominate the variance, if all axes are of the same scale.
  2. Your data are measurements of the same multi-dimensional sensor, such as an image

It seems the last point is the case in the tutorial - 8x8 digits are really a 64-channel sensor. Each pixel is already normalized in the sensor (since the dataset is assumed to be clean, I believe).

PCA won't work if

  1. Your data is of different (constant) scales, but you want to keep the absolute differences in the data
  2. Your data has a variance of scales

it is not hard to find examples when PCA doesn't work. It is only a linear model, after all.


This doesn't say what you should do with your own 8548x7 data. just by the shape, I am assuming you should normalize in that case.

I hope this gives some inspiration for further thinking.