1
votes

I want to run some machine learning algorithms such as PCA and KNN with a relatively large dataset of images (>2000 rgb images) in order to classify these images.

My source code is the following:

import cv2
import numpy as np
import os
from glob import glob
from sklearn.decomposition import PCA
from sklearn import neighbors
from sklearn import preprocessing


data = []

# Read images from file
for filename in glob('Images/*.jpg'):

    img = cv2.imread(filename)
    height, width = img.shape[:2]
    img = np.array(img)

    # Check that all my images are of the same resolution
    if height == 529 and width == 940:

        # Reshape each image so that it is stored in one line
        img = np.concatenate(img, axis=0)
        img = np.concatenate(img, axis=0)
        data.append(img)

# Normalise data
data = np.array(data)
Norm = preprocessing.Normalizer()
Norm.fit(data)
data = Norm.transform(data)

# PCA model
pca = PCA(0.95)
pca.fit(data)
data = pca.transform(data)

# K-Nearest neighbours
knn = neighbors.NearestNeighbors(n_neighbors=4, algorithm='ball_tree', metric='minkowski').fit(data)
distances, indices = knn.kneighbors(data)

print(indices)

However, my laptop is not sufficient for this task as it needs many hours in order to process more than 700 rgb images. So I need to use the computational resources of an online platform (e.g. like the ones provided by GCP). How can I simply use some of the resources of GCP (faster CPUs, GPU etc) to run my source code above?

Can I simply make a call from Pycharm to Compute Engine API (after a I have created a virtual machine in it) to run my python script?

Or the only possible solution is either to install PyCharm in the virtual machine and run the python script in it or do what these answers suggest in the virtual machine (Running a python script on Google Cloud Compute Engine, Run python script on Google Cloud Compute Engine)?

1

1 Answers

1
votes

First it sounds like you'll need to move your images into somewhere GCP will be able to access them, like Google Cloud Storage (GCS). You can't run your code on GCP without having the images there as well. You can then use Compute Engine to run your python code, maybe in a docker container. You'll have to expand out your code so that you can kick off the process, access GCS to get the images and store the result somewhere.

I would look into using Google Dataproc, if you're serious about about using the cloud to process a lot of information then Dataproc is managed service for doing a lot of work in a scaled way. It can pull information from GCS, run your code, spread the load across a cluster of machines and store the results in a database like BigQuery or Cloud SQL.