Multithreading with cursors in Google Datastore

Question

I want to load a lot of data from Google Datastore.

So, Step 1: I run the query (using keysOnly=true) and loop through the cursors, so that each one is pointing to the start of a page of 600 objects. I store the cursors in a local variable.

Step 2: I spin off one thread per cursor, loading and processing 600 objects in each thread.

It is not the usual way that cursors are used.

However, it looks correct to me. The actual query strings in Step 1 and Step 2 are identical. This resembles the usual stateless web use-case where a user may ask for Next, Back, then reload a previous page; there is no need for a cursor to come directly from the result of the previous cursor-query.

I don't want to step through cursors sequentially and then spin off threads in order to parallelize the processing of objects loaded in a given cursor-query, because I want to parallelize the actual IO-intensive querying from the DB.

I am getting some inconsistency in results that seem to involve missed pages and duplicate loading of objects. Is this the correct way to multithread the loading of large amounts of data from Google Datastore? Or if not, what is?

LocationsCloudToHub One approach is to load each entity from Darattiore sequentially, but then spin off the initialisation process for each one in a new thread, or use a queue. See @Andrei Volgin answer — Joshua Fox

Andrei Volgin Andrei Volgin · Accepted Answer · 2016-04-06T15:15:43

I would recommend a different approach. Run only one query that cycles through all of your entities. It happens very fast (don't forget to set the batch size to 500, the default is only 10). You still may need to use cursors, if the query is huge.

For every entity create a task using Task API and add it to the task queue. These tasks can be executed in parallel. You can set all the parameters on your queue.

With this approach you don't have to worry about threads, you can set tasks to automatically retry when they fail, etc. I find it to be a very important part of the App Engine's appeal - write only your own logic, and let App Engine worry about the execution part.

Multithreading with cursors in Google Datastore

3 Answers