For a particular task I'm working on I have a dataset that is about 25 GB. I'm still experimenting with several methods of preprocessing and definitely don't have my data to it's final form yet. I'm not sure what the common workflow is for this sort of problem, so here is what I'm thinking:
- Copy dataset from bucket storage to Compute Engine machine SSD (maybe use around 50 GB SSD) using gcsfuse.
- Apply various preprocessing operations as an experiment.
- Run training with PyTorch on the data stored on the local disk (SSD)
- Copy newly processed data back to storage bucket with gcsfuse if it was successful.
- Upload results and delete the persistent disk that was used during training.
The alternative approach is this:
- Run the processing operations on the data within the Cloud Bucket itself using the mounted directory with gcsfuse
- Run training with PyTorch directly on the mounted gcsfuse Bucket directory, using a compute engine instance with very limited storage.
- Upload Results and Delete Compute Engine Instance.
Which of these approaches is suggested? Which will incur fewer charges and is used most often when running these kind of operations. Is there a different workflow that I'm not seeing here?