I have 4 classes of images (72 .png files for each class) and I am carrying out PCA on them, in order to find the two components that show most variance on the data. Here's the code:
data_list = []
for file in fileList: # fileList contains the name of the 72x4 .png files
img_data = np.asarray(Image.open('C:\\Users\\Gian\\Desktop\\UNI\\' \
'Sapienza\\Machine Learning\\Homeworks\\' \
'First\\Data\\samples\\' + file)) # open image
x = img_data.ravel() # transform image into 49152-length vector
data_list.append(x)
X = np.array(data_list) # create data matrix (72*4=288 rows by 49152 columns)
Now, at this point I just use scikit-learn
, applying the fit_transform
function on the data matrix:
# Transform data on to new subspace and plot
X_t = PCA(2).fit_transform(X)
plt.scatter(X_t[0:72,0], X_t[0:72,1], c='y')
plt.scatter(X_t[72:144,0], X_t[72:144,1], c='m')
plt.scatter(X_t[144:216,0], X_t[144:216,1], c='r')
plt.scatter(X_t[216:288,0], X_t[216:288,1], c='g')
plt.show()
I get this graph:
which seems good. Classes are pretty distinct, although I think the graph is maybe mirrored with respect to the y-axis (not sure about this). So I decide to double check the results. I use numpy
to compute the covariance matrix of X, the eigenvalues and eigenvectors:
covX = np.cov(X)
eig_val, eig_vec = np.linalg.eig(covX)
# make a list of (eig_val, eig_vec) tuples
eig_pairs = [(np.abs(eig_val_cov[i]), eig_vec_cov[:,i]) for i in range(len(eig_val_cov))]
# Sort the tuples from highest eigenvalue to lowest
eig_pairs.sort(key=lambda x: x[0], reverse=True)
I choose the two eigenvectors with highest eigenvalues (two principal components), and create a matrix with them:
matrix_w = np.hstack((eig_pairs[0][1].reshape(288,1), eig_pairs[1][1].reshape(288,1))) # eigenvectors along columns
Now, here's the problem. I have to transform the data matrix onto the new subspace. So I just compute:
standX = (X - mean(X))/std(X) # standardized data matrix (zero mean and unit variance)
X_t = matrix_w.T.dot(standX) # transformed data matrix
This is what I am not understanding: when I used fit_transform
I got a transformed matrix X_t
which is 288 (=number of images) rows by 2 columns. Now my X_t
is 2 rows by 49152 columns! Why are they different? It feels like I should be getting a 288x2 X_t
, as indeed it was when I used fit_transform
, but I don't get how.
Also, how can standX
(288 rows by 49152 columns) be multiplied by the eigenvector matrix matrix_w.T
(which is 2 rows by 288 columns)? There's something wrong in my code, but I don't understand what exactly.