0
votes

I am trying to copy files from a directory on my Google Compute Instance to Google Cloud Storage Bucket. I have it working, however there are ~35k files but only ~5k have an data in them.

Is there anyway to only copy files above a certain size?

2
I edited my answer since at the beginning I proposed it to be from a bucket (being the source), but you're rather a compute engine instance, I supposed that it's a linux, if not it's possible to install a linux emulator to run the du and awk commandsEmmanuel

2 Answers

0
votes

I've not tried this but...

You should be able to do this using a resumable transfer and setting the threshold to 5k (defaults to 8Mib). See: https://cloud.google.com/storage/docs/gsutil/commands/cp#resumable-transfers

May be advisable to set BOTO_CONFIG specifically for this copy (a) to be intentional; (b) to remind yourself how it works. See: https://cloud.google.com/storage/docs/boto-gsutil

Resumable uploads has the added benefit, of course, of resuming if there are any failures.

Recommend: try this on a small subset and confirm it works to your satisfaction.

0
votes

While it's not possible to do it only with gsutil, it's possible to do it by parsing the names and use the -I flag on the cp command to process them. If you're using a Linux Compute Engine instance you can perform it by using the du and awk commands:

du * |  awk '{if ($1 > 1000) print $2 }' | gsutil -m cp -I gs://bucket2

The command will get the filesize of the files inside the current directory on your compute engine with du * and will only copy the files which size are larger than 1000 bytes to bucket2, you can change that value to adjust it to your needs.