1
votes

Currently, I'm using Google's 2-step method to backup the datastore and than import it to BigQuery. I also reviewed the code using pipeline. Both methods are not efficient and have high cost since all data is imported everytime. I need only to add the records added from last import.

What is the right way of doing it? Is there a working example on how to do it in python?

2

2 Answers

3
votes

You can look at Streaming inserts. I'm actually looking at doing the same thing in Java at the moment.

If you want to do it every hour, you could maybe add your inserts to a pull queue (either as serialised entities or keys/IDs) each time you put a new entity to Datastore. You could then process the queue hourly with a cron job.

2
votes

There is no full working example (as far as I know), but I believe that the following process could help you :

1- You'd need to add a "last time changed" to your entities, and update it.

2- Every hour you can run a MapReduce job, where your mapper can have a filter to check for last time updated and only pick up those entities that were updated in the last hour

3- Manually add what needs to be added to your backup.

As I said, this is pretty high level, but the actual answer will require a bunch of code. I don't think it is suited to Stack Overflow's format honestly.