6
votes

I'm attempting kaggle.com's digit recognizer competition using Python and scikit-learn.

After removing labels from the training data, I add each row in CSV into a list like this:

for row in csv:
    train_data.append(np.array(np.int64(row)))

I do the same for the test data.

I pre-process this data with PCA in order to perform dimension reduction (and feature extraction?):

def preprocess(train_data, test_data, pca_components=100):
    # convert to matrix
    train_data = np.mat(train_data)

    # reduce both train and test data
    pca = decomposition.PCA(n_components=pca_components).fit(train_data)
    X_train = pca.transform(train_data)
    X_test = pca.transform(test_data)

    return (X_train, X_test)

I then create a kNN classifier and fit it with the X_train data and make predictions using the X_test data.

Using this method I can get around 97% accuracy.

My question is about the dimensionality of the data before and after PCA is performed

What are the dimensions of train_data and X_train?

How does the number of components influence the dimensionality of the output? Are they the same thing?

1

1 Answers

11
votes

The PCA algorithm finds the eigenvectors of the data's covariance matrix. What are eigenvectors? Nobody knows, and nobody cares (just kidding!). What's important is that the first eigenvector is a vector parallel to the direction along which the data has the largest variance (intuitively: spread). The second one denotes the second-best direction in terms of the maximum spread, and so on. Another important fact is that these vectors are orthogonal to each other, so they form a basis.

The pca_components parameter tells the algorithm how many best basis vectors are you interested in. So, if you pass 100 it means you want to get 100 basis vectors that describe (statistician would say: explain) most of the variance of your data.

The transform function transforms (srsly?;)) the data from the original basis to the basis formed by the chosen PCA components (in this example - the first best 100 vectors). You can visualize this as a cloud of points being rotated and having some of its dimensions ignored. As correctly pointed out by Jaime in the comments, this is equivalent of projecting the data onto the new basis.

For the 3D case, if you wanted to get a basis formed of the first 2 eigenvectors, then again, the 3D point cloud would be first rotated, so the most variance would be parallel to the coordinate axes. Then, the axis where the variance is smallest is being discarded, leaving you with 2D data.

So, to answer your question directly: yes, the number of the desired PCA components is the dimensionality of the output data (after the transformation).