1
votes

I want to load a lot of data from Google Datastore.

So, Step 1: I run the query (using keysOnly=true) and loop through the cursors, so that each one is pointing to the start of a page of 600 objects. I store the cursors in a local variable.

Step 2: I spin off one thread per cursor, loading and processing 600 objects in each thread.

It is not the usual way that cursors are used.

However, it looks correct to me. The actual query strings in Step 1 and Step 2 are identical. This resembles the usual stateless web use-case where a user may ask for Next, Back, then reload a previous page; there is no need for a cursor to come directly from the result of the previous cursor-query.

I don't want to step through cursors sequentially and then spin off threads in order to parallelize the processing of objects loaded in a given cursor-query, because I want to parallelize the actual IO-intensive querying from the DB.

I am getting some inconsistency in results that seem to involve missed pages and duplicate loading of objects. Is this the correct way to multithread the loading of large amounts of data from Google Datastore? Or if not, what is?

3
did you solve that? I have a simliar problem. - aydunno
LocationsCloudToHub One approach is to load each entity from Darattiore sequentially, but then spin off the initialisation process for each one in a new thread, or use a queue. See @Andrei Volgin answer - Joshua Fox

3 Answers

3
votes

I would recommend a different approach. Run only one query that cycles through all of your entities. It happens very fast (don't forget to set the batch size to 500, the default is only 10). You still may need to use cursors, if the query is huge.

For every entity create a task using Task API and add it to the task queue. These tasks can be executed in parallel. You can set all the parameters on your queue.

With this approach you don't have to worry about threads, you can set tasks to automatically retry when they fail, etc. I find it to be a very important part of the App Engine's appeal - write only your own logic, and let App Engine worry about the execution part.

1
votes

Depending on what you're doing, you have a few options:

  1. If you have a large amount of data - use a 'fan out' model with task queues. In this model, task queue jobs load a segment of data, process it and store a result, and possibly trigger more processing jobs. Taskqueue throttling allows you to control throughput/duration/cost and also handle failure/retry. An advantage of this model is you can test and rerun segments by poking URLs manually, and view progress in the admin panel.

  2. Use GAE MapReduce - https://cloud.google.com/appengine/docs/java/dataprocessing/

  3. In a single process if you have a small amount of data. The drawback is request deadline (60s, 10m or 24hrs - depending on the type of server and request). Recall that datastore operations are asynchronous, so you can run requests in parallel in a single thread, which may simplify your code. How many before they become blocking is (I believe) controlled by max-concurrent-requests in your appengine-web.xml or app.yaml. This can be very expensive if your request can fail and isn't built to be resumable.

0
votes

Ed Davisson, a Google engineer who works on the Google Datastore Client API, answered this. He provided the root cause of the problem and a recommended solution.

He says:

"The cursors returned by a query are only valid for use in the same query. When you switch from the keys-only query [In my Step 1, JF] to the non-keys-only query [In my Step 2, JF], the cursors are no longer applicable....

"If your goal is to split a result set into similar sized chunks, you might want to take a look at QuerySplitter [which is now in version 1beta3, JF]."