Automatically retain latest datastore backup

Question

I'm looking for the best strategy to collect specific datastore *.backup_info files stored in Cloud Storage and copy them as the "latest" backup_info files per kind, so I have a fix location for each kind, where the most recent backup_info file is found, e.g.

gs://MY-PROJECT.appspot.com/latest/Comment.backup_info

Basically, I have a Google App Engine app (Python standard) with data in Cloud Datastore. I can run a cron-job to perform backups automatically and regularly as described in the docs Scheduled Backups and I can also write a bit of Python code to execute backup tasks which is triggered manually as described in this SO answer. I plan to write a small Python cron-job that would perform the task to find the most recent backup_info file of a given kind and copy/rename it to the desired location.

Either way, the original backup location will be crowded with lots of files and folders during a day, especially if there is more than one backup for a certain kind. For example in gs://MY-PROJECT.appspot.com/ I will find:

VeryLoooooongRandomLookingString.backup_info
OtherStringForSecondBackup.backup_info
OtherStringForThirdBackup.backup_info

The string seems to be a unique identifier for every backup execution. I assume, it contains a list of *.backup_info files, one for each kind in the backup.

VeryLoooooongRandomLookingString.Comment.backup_info
OtherStringForSecondBackup.Comment.backup_info
OtherStringForThirdBackup.Comment.backup_info

For every kind in the backup, e.g. "Comment". It seems it contains a list of actual backup data for this kind and this backup.

datastore_backup_CUSTOM_PREFIX_2017_09_20_Comment/
datastore_backup_CUSTOM_PREFIX_2017_09_20_1_Comment/
datastore_backup_CUSTOM_PREFIX_2017_09_20_2_Comment/

Data folder for each backup and kind. Here for kind "Comment", backed up three times on 9/20.

My questions are related to Datastore and/or Storage:

Is it possible to explicitly specify a custom UID as a query parameter (or in HTTP header) when calling /_ah/datastore_admin/backup.create?
If not, is it possible to send a message with the UID to a hook or something, after the backup has been completed?
If (1) and (2) is not possible: Which approach would be the best in Storage to find the latest *.backup_info file for a given kind? It seems that listbucket() doesn't allow filtering, and I don't think that iterating through hundreds or thousands of files looking for certain name patterns would be efficient.

Ani Ani · Accepted Answer · 2017-10-04T22:21:46

I have found two solutions for the problem, one is in GA and one is in Beta.

The answers in short:

The GA Datastore Export & Import service allows custom and predictable paths to the backup
and its API for long-running operations allows to get the output URL of a backup job (e.g. for paths with timestamps).
A Cloud Function triggered by Cloud Storage events would allow to handle just specific [KIND].backup_info files as soon as they are added to a bucket, instead of paging through thousands of files in the bucket each time.

Datastore Export & Import

This new service has an API to run export jobs (manually or scheduled). The job allows to specify the path and produces predictable full paths, so existing backup files could be overwritten if only the latest backup is needed at any time, e.g.:

gs://[YOUR_BUCKET]/[PATH]/[NAMESPACE]/[KIND]/[NAMESPACE]_[KIND].export_metadata

For cron-jobs, the App Engine handler URL is /cloud-datastore-export (instead of the old /_ah/datastore_admin/backup.create). Also the format of the export is different from the old export. It can be imported to BigQuery, too, just like the old [KIND].backup_info files.

Cloud Function

Deploy a Cloud Function (JavaScript / Node.js) that is triggered by any change in the backup bucket and if that file exists (file.resourceState === 'not_exists'), is new (file.metageneration === '1') and in fact is one of the [KIND].backup_info files we want, it will be copied to a different bucket ("latest_backups" or so). Custom metadata on the copy can be used to compare timeCreated in later executions of the function (so we don't accidentally overwrite more recent backup file with older file). Copying or moving actual backup payload will break the references inside the [KINDNAME].backup_info files though.

Automatically retain latest datastore backup

1 Answers

Datastore Export & Import

Cloud Function