0
votes

I'm on an DRA (Durable Reduced Availability) bucket and I perform the gsutil rsync command quite often to upload/download files to/from the bucket.

Since file) could be unavailable (because of the DRA), what exactly will happen during a gsutil rsync session when such a scenario is being hit?

  1. Will gsutil just wait until the unavailable files becomes available and complete the task, thus always downloading everything from the bucket?
  2. Or will gsutil exit with a warning about a certain file not being available, and if so exactly what output is being used (so that I can make a script to look for this type of message)?
  3. What will the return code be of the gsutil command in a session where files are found to be unavailable?

I need to be 100% sure that I download everything from the bucket, which I'm guessing can be difficult to keep track of when downloading hundreds of gigabytes of data. In case gsutil rsync completes without downloading unavailable files, is it possible to construct a command which retries the unavailable files until all such files have been successfully downloaded?

2

2 Answers

2
votes
  1. If your files exceed the resumable threshold (as of 4.7, this is 8MB), any availability issues will be retried with exponential backoff according to the num_retries and max_retry_delay configuration variables. If the file is smaller than the threshold, it will not be retried (this will be improved in 4.8 so small files also get retries).
  2. If any file(s) fail to transfer successfully, gsutil will halt and output an exception depending on the failure encountered. If you are using gsutil -m rsync or gsutil rsync -C, gsutil will continue on errors and at the end, you'll get a CommandException with the message 'N file(s)/object(s) could not be copied/removed'
  3. If retries are exhausted and/or either of the failure conditions described in #2 occur, the exit code will be nonzero.

In order to ensure that you download all files from the bucket, you can simply rerun gsutil rsync until you get a nonzero exit code.

Note that gsutil rsync relies on listing objects. Listing in Google Cloud Storage is eventually consistent. So if you are upload files to the bucket and then immediately run gsutil rsync, it is possible you will miss newly uploaded files, but the next run of gsutil rsync should pick them up.

1
votes

I did some tests on a project and could not get gsutil to throw any errors. Afaik, gsutil operates on the directory level, it is not looking for a specific file.

When you run, for example $ gsutil rsync local_dir gs://bucket , gsutil is not expecting any particular file, it just takes whatever you have in "local_dir" and uploads it to gs://bucket, so :

  1. gsutil will not wait, it will complete.

  2. you will not get any errors - the only errors I got is when the local directory or bucket are missing entirely.

  3. if, let´s say a file is missing on local_dir, but it is available in the bucket and then you run $ gsutil rsync -r local_dir gs://bucket, then nothing will change in the bucket. with the "-d" option, the file will be deleted on the bucket side.

As a suggestion, you could just add a crontab entry to rerun the gstuil command a couple of times a day or at night.

Another way is to create a simple script and add it to your crontab to run every hour or so. this will check if your file exists, and if so it will run the gsutil command:

#!/bin/bash
FILE=/home/user/test.txt

if [ -f $FILE ];
then
   echo "file exists..or something"
else
   gsutil rsync /home/user gs://bucket
fi

UPDATE :

I think this may be what you need. In ~/ you should have a .boto file .

~$ more .boto | grep max
# num_retries = <integer value>
# max_retry_delay = <integer value> 

Uncomment those lines and add your numbers. Default is 6 retries, so you could do something like 24 retries and put 3600s in between. This in theory should always keep looping .

Hope this helps !