2
votes

Hi I tried to apply PCA on a folder with many pics inside (.jpg). However, I stuck on converting it to the format that scikit-learn PCA accepts. It seems that PCA takes array data format. I read articles like PCA for image data but it looks quite complicated for me. I just want to convert images to accepted format then use pca.fit

Before I used os.walk to change images to gray scales and resize them (as below). I was wondering if I can use it on PCA as well.

from sklearn.decomposition import PCA
from PIL import Image 
import os
import numpy as np

WORK_DIR = 'D:/folder/' #working folder
source = os.path.join(WORK_DIR, 'train')  
target = os.path.join(WORK_DIR, 'gray')  

for root, dirpath, filenames in os.walk(source):
    for file in filenames:
        image_file = Image.open(os.path.join(root, file))
        image_file.draft('L', (256, 128)) 
        image_file.save(os.path.join(target, file))

Any other easier methods will be great too.

1

1 Answers

1
votes

After reading the image data, it would be a 2D array. You have to flatten it out, .flatten() would do that. Now you can use this data for pca.fit().

from sklearn.decomposition import PCA
from PIL import Image 
import os
import numpy as np

WORK_DIR = 'D:/folder/' #working folder
source = os.path.join(WORK_DIR, 'train')  
target = os.path.join(WORK_DIR, 'gray')  

train_data=[]
for root, dirpath, filenames in os.walk(source):
    for file in filenames:
        image_file = os.path.join(root, file)
        print(image_file)
        train_data.append(np.array(Image.open(image_file,'r')).flatten())

pca=PCA()
pca.fit(train_data)