3
votes

Our requirement is to programmatically backup Google Datastore and load these backups to Google Big query for further analysis. We were successful in automating backups using the following approach

        Queue queue = QueueFactory.getQueue("datastoreBackupQueue");

        /*
         * Create a task which is equivalent to the backup URL mentioned in
         * above cron.xml, using new queue which has Datastore admin enabled
         */
        TaskOptions taskOptions = TaskOptions.Builder.withUrl("/_ah/datastore_admin/backup.create")
                .method(TaskOptions.Method.GET).param("name", "").param("filesystem", "gs")
                .param("gs_bucket_name",
                        "db-backup" + "/" + TimeUtils.parseDateToString(new Date(), "yyyy/MMM/dd"))
                .param("queue", queue.getQueueName());

        /*
         * Get list of dynamic entity kind names from the datastore based on
         * the kinds present in the datastore at the start of backup
         */
        List<String> entityNames = getEntityNamesForBackup();
        for (String entityName : entityNames) {
            taskOptions.param("kind", entityName);
        }

        /* Add this task to above queue */
        queue.add(taskOptions);

I was able to then import this backups to Google Bigquery manually, But how do we automate this process?

I have also looked at most of the docs and nothing helped https://cloud.google.com/bigquery/docs/loading-data-cloud-storage#loading_data_from_google_cloud_storage

3

3 Answers

2
votes

I have solved this myself, Here is the solution using JAVA The following code will pickup the backup files from GoogleCloud storage and load it into Google Big Query.

        AppIdentityCredential bqCredential = new AppIdentityCredential(
                Collections.singleton(BigqueryScopes.BIGQUERY));

        AppIdentityCredential dsCredential = new AppIdentityCredential(
                Collections.singleton(StorageScopes.CLOUD_PLATFORM));

        Storage storage = new Storage(HTTP_TRANSPORT, JSON_FACTORY, dsCredential);
        Objects list = storage.objects().list(bucket).setPrefix(prefix).setFields("items/name").execute();

        if (list == null) {
            Log.severe(BackupDBController.class, "BackupToBigQueryController",
                    "List from Google Cloud Storage was null", null);
        } else if (list.isEmpty()) {
            Log.severe(BackupDBController.class, "BackupToBigQueryController",
                    "List from Google Cloud Storage was empty", null);
        } else {

            for (String kind : getEntityNamesForBackup()) {
                Job job = new Job();
                JobConfiguration config = new JobConfiguration();
                JobConfigurationLoad loadConfig = new JobConfigurationLoad();

                String url = "";
                for (StorageObject obj : list.getItems()) {
                    String currentUrl = obj.getName();
                    if (currentUrl.contains(kind + ".backup_info")) {
                        url = currentUrl;
                        break;
                    }
                }

                if (StringUtils.isStringEmpty(url)) {
                    continue;
                } else {
                    url = "gs://"+bucket+"/" + url;
                }

                List<String> gsUrls = new ArrayList<>();
                gsUrls.add(url);

                loadConfig.setSourceUris(gsUrls);
                loadConfig.set("sourceFormat", "DATASTORE_BACKUP");
                loadConfig.set("allowQuotedNewlines", true);

                TableReference table = new TableReference();
                table.setProjectId(projectId);
                table.setDatasetId(datasetId);
                table.setTableId(kind);
                loadConfig.setDestinationTable(table);

                config.setLoad(loadConfig);
                job.setConfiguration(config);

                Bigquery bigquery = new Bigquery.Builder(HTTP_TRANSPORT, JSON_FACTORY, bqCredential)
                        .setApplicationName("BigQuery-Service-Accounts/0.1").setHttpRequestInitializer(bqCredential)
                        .build();
                Insert insert = bigquery.jobs().insert(projectId, job);

                JobReference jr = insert.execute().getJobReference();
                Log.info(BackupDBController.class, "BackupToBigQueryController",
                        "Moving data to BigQuery was successful", null);
            }
        }

If anyone has a better approach, Please let me know

0
votes

On the loading data from Google Cloud Storage article that you mentioned in your question, some programmatic examples of importing from GCS are described that use command line, Node.JS or Python.

You can also automate the import data located on cloud storage to BigQuery, by running the following command in your script:

$ gcloud alpha bigquery import SOURCE DESTINATION_TABLE

For more information on this command visit this article.

0
votes

As of last week there's a proper way to automate this. The most important part is gcloud beta datastore export.

I created a short script around that: https://github.com/chees/datastore2bigquery

You can adjust that to fit your situation.