I am trying to copy files from a directory on my Google Compute Instance to Google Cloud Storage Bucket. I have it working, however there are ~35k files but only ~5k have an data in them.
Is there anyway to only copy files above a certain size?
I've not tried this but...
You should be able to do this using a resumable transfer and setting the threshold to 5k (defaults to 8Mib). See: https://cloud.google.com/storage/docs/gsutil/commands/cp#resumable-transfers
May be advisable to set BOTO_CONFIG
specifically for this copy (a) to be intentional; (b) to remind yourself how it works. See: https://cloud.google.com/storage/docs/boto-gsutil
Resumable uploads has the added benefit, of course, of resuming if there are any failures.
Recommend: try this on a small subset and confirm it works to your satisfaction.
While it's not possible to do it only with gsutil, it's possible to do it by parsing the names and use the -I
flag on the cp
command to process them. If you're using a Linux Compute Engine instance you can perform it by using the du and awk commands:
du * | awk '{if ($1 > 1000) print $2 }' | gsutil -m cp -I gs://bucket2
The command will get the filesize of the files inside the current directory on your compute engine with du *
and will only copy the files which size are larger than 1000 bytes to bucket2
, you can change that value to adjust it to your needs.
du
andawk
commands – Emmanuel