5
votes

I have a bucket in google cloud storage. I have a tmp folder in bucket. Thousands of files are being created each day in this directory. I want to delete files that are older than 1 day every night. I could not find an argument on gsutil for this job. I had to use a classic and simple shell script to do this. But the files are deleting very slowly.

I have 650K files accumulated in the folder. 540K of them must be deleted. But my own shell script worked for 1 day and only 34K files could be deleted.

The gsutil lifecycle feature is not able to do exactly what I want. He's cleaning the whole bucket. I just want to delete the files regularly at the bottom of certain folder.. At the same time I want to do deletion faster.

I'm open to your suggestions and your help. Can I do this with a single gsutil command? or a different method?

simple script I created for testing (I prepared to delete bulk files temporarily.)

    ## step 1 - I pull the files together with the date format and save them to the file list1.txt.
gsutil -m ls -la gs://mygooglecloudstorage/tmp/ | awk '{print $2,$3}' > /tmp/gsutil-tmp-files/list1.txt


## step 2 - I filter the information saved in the file list1.txt. Based on the current date, I save the old dated files to file list2.txt.
cat /tmp/gsutil-tmp-files/list1.txt | awk -F "T" '{print $1,$2,$3}' | awk '{print $1,$3}' | awk -F "#" '{print $1}' |grep -v `date +%F` |sort -bnr > /tmp/gsutil-tmp-files/list2.txt


## step 3 - After the above process, I add the gsutil delete command to the first line and convert it into a shell script.
cat /tmp/gsutil-tmp-files/list2.txt | awk '{$1 = "/root/google-cloud-sdk/bin/gsutil -m rm -r "; print}' > /tmp/gsutil-tmp-files/remove-old-files.sh


## step 4 - I'm set the script permissions and delete old lists.
chmod 755 /tmp/gsutil-tmp-files/remove-old-files.sh
rm -rf /tmp/gsutil-tmp-files/list1.txt /tmp/gsutil-tmp-files/list2.txt


## step 5 - I run the shell script and I destroy it after it is done.
/bin/sh /tmp/gsutil-tmp-files/remove-old-files.sh
rm -rf /tmp/gsutil-tmp-files/remove-old-files.sh
2
I solved this problem by mounting my bucket with gcsfuse tool. now I can manage my bucket like a local disk. But it is still slow to do many operations on the disk. Still, I can quickly clear it for now. For detailed information about gcsfuse; cloud.google.com/storage/docs/gcs-fuse But I think that Google needs an automatic solution to these needs. Those who have similar problems can use this method. I am open to new ideas of those who have a better method in this regard.spala

2 Answers

10
votes

There is a very simple way to do this, for example:

gsutil -m ls -l gs://bucket-name/ | grep 2017-06-23 | grep .jpg  | awk '{print $3}' | gsutil -m rm -I
3
votes

There isn't a simple way to do this with gsutil or object lifecycle management as of today.

That being said, would it be feasible for you to change the naming format for the objects in your bucket? That is, instead of uploading them all under "gs://mybucket/tmp/", you could append the current date to that prefix, resulting in something like "gs://mybucket/tmp/2017-12-27/". The main advantages to this would be:

  • Not having to do a date comparison for every object; you could run gsutil ls "gs://mybucket/tmp/" | grep "gs://[^/]\+/tmp/[0-9]\{4\}-[0-9]\{2\}-[0-9]\{2\}/$" to find those prefixes, then do date comparisons on the last portion of of those paths.
  • Being able to supply a smaller number of arguments on the command line (prefixes, rather than the name of each individual file) to gsutil -m rm -r, thus being less likely to pass in more arguments than your shell can handle.