Cloud Storage Buckets for PyTorch

Question

For a particular task I'm working on I have a dataset that is about 25 GB. I'm still experimenting with several methods of preprocessing and definitely don't have my data to it's final form yet. I'm not sure what the common workflow is for this sort of problem, so here is what I'm thinking:

Copy dataset from bucket storage to Compute Engine machine SSD (maybe use around 50 GB SSD) using gcsfuse.
Apply various preprocessing operations as an experiment.
Run training with PyTorch on the data stored on the local disk (SSD)
Copy newly processed data back to storage bucket with gcsfuse if it was successful.
Upload results and delete the persistent disk that was used during training.

The alternative approach is this:

Run the processing operations on the data within the Cloud Bucket itself using the mounted directory with gcsfuse
Run training with PyTorch directly on the mounted gcsfuse Bucket directory, using a compute engine instance with very limited storage.
Upload Results and Delete Compute Engine Instance.

Which of these approaches is suggested? Which will incur fewer charges and is used most often when running these kind of operations. Is there a different workflow that I'm not seeing here?

Federico Panunzio Federico Panunzio · Accepted Answer · 2018-08-02T11:01:33

On the billing side, the charges would be the same, as the fuse operations are charged like any other Cloud Storage interface according to the documentation. In your use case I don’t know how you are going to train the data, but if you do more than one operation to files it would be better to have them downloaded, trained locally and then the final result uploaded, which would be 2 object operations. If you do, for example, more than one change or read to a file during the training, every operation would be an object operation. On the workflow side, the proposed one looks good to me.

Cloud Storage Buckets for PyTorch

1 Answers