PCA in Python: scikit-learn vs numpy approach

Question

I have 4 classes of images (72 .png files for each class) and I am carrying out PCA on them, in order to find the two components that show most variance on the data. Here's the code:

data_list = []

for file in fileList: # fileList contains the name of the 72x4 .png files
    img_data = np.asarray(Image.open('C:\\Users\\Gian\\Desktop\\UNI\\' \
                                  'Sapienza\\Machine Learning\\Homeworks\\' \
                                  'First\\Data\\samples\\' + file)) # open image
    x = img_data.ravel() # transform image into 49152-length vector
    data_list.append(x)

X = np.array(data_list) # create data matrix (72*4=288 rows by 49152 columns)

Now, at this point I just use scikit-learn, applying the fit_transform function on the data matrix:

# Transform data on to new subspace and plot
X_t = PCA(2).fit_transform(X)
plt.scatter(X_t[0:72,0], X_t[0:72,1], c='y')
plt.scatter(X_t[72:144,0], X_t[72:144,1], c='m')
plt.scatter(X_t[144:216,0], X_t[144:216,1], c='r')
plt.scatter(X_t[216:288,0], X_t[216:288,1], c='g')
plt.show()

I get this graph: which seems good. Classes are pretty distinct, although I think the graph is maybe mirrored with respect to the y-axis (not sure about this). So I decide to double check the results. I use numpy to compute the covariance matrix of X, the eigenvalues and eigenvectors:

covX = np.cov(X)
eig_val, eig_vec = np.linalg.eig(covX)

# make a list of (eig_val, eig_vec) tuples
eig_pairs = [(np.abs(eig_val_cov[i]), eig_vec_cov[:,i]) for i in range(len(eig_val_cov))]

# Sort the tuples from highest eigenvalue to lowest
eig_pairs.sort(key=lambda x: x[0], reverse=True)

I choose the two eigenvectors with highest eigenvalues (two principal components), and create a matrix with them:

matrix_w = np.hstack((eig_pairs[0][1].reshape(288,1), eig_pairs[1][1].reshape(288,1))) # eigenvectors along columns

Now, here's the problem. I have to transform the data matrix onto the new subspace. So I just compute:

standX = (X - mean(X))/std(X) # standardized data matrix (zero mean and unit variance)
X_t = matrix_w.T.dot(standX) # transformed data matrix

This is what I am not understanding: when I used fit_transform I got a transformed matrix X_t which is 288 (=number of images) rows by 2 columns. Now my X_t is 2 rows by 49152 columns! Why are they different? It feels like I should be getting a 288x2 X_t, as indeed it was when I used fit_transform, but I don't get how. Also, how can standX (288 rows by 49152 columns) be multiplied by the eigenvector matrix matrix_w.T (which is 2 rows by 288 columns)? There's something wrong in my code, but I don't understand what exactly.

maxymoo maxymoo · Accepted Answer · 2016-10-18T04:44:58

I think you may have got your matrix multiplication the wrong way around. Does it work if you do something like X_t = standX.dot(matrix_w) ?

PCA in Python: scikit-learn vs numpy approach

1 Answers