4
votes

I need to read all the entries in a Google AppEngine datastore to do some initialization work. There are a lot of entities (80k currently) and this continues to grow. I'm starting to hit the 30 second datastore query timeout limit.

Are there any best practices for how to shard these types of huge reads in the datastore? Any examples?

2
Could you explain the use case? - Sebastian Kreft
I have a query which basically just does a scan on my datastore for entities of a particular kind. There are about 80k of them in there and they take a long time to read, about 45 seconds. This exceeds the datastore read timeout which means that these table scans fail. I'm trying to understand how I can somehow break up my reads into small chunks or otherwise push this to some longer deadline type of processing so that my initialization won't fail. Also, the number of entities I have (80k today) is likely to grow so I'd like this to work for 800k entities. @SebastianKreft - user1617999
Sounds like a job for a mapreduce. - Daniel Roseman
Without knowing more about what the data is and why you would want to query so much of it at once all I can do is agree with @DanielRoseman that mapreduce tends to be a good tool for jobs of this size. With more information about the reasoning and purpose behind the query and data we may be able to provide better advice. - Bryce Cutt

2 Answers

3
votes

You can tackle this in several ways:

  1. Execute your code on Task Queue which has 10min timeout instead of 30s (more like 60s in practice). The easiest way to do this is via DeferredTask.

    Warning: DeferredTask must be serializable, so it's hard to pass it complex data. Also dont make it an inner class.

  2. See backends. Requests served by backend instance do not have time limit.

  3. Finally, if you need to break-up a big task and execute in parallel than look at mapreduce.

0
votes

This answer on StackExchange served me well:

Expired queries and appengine

I had to slightly modify it to work for me:

def loop_over_objects_in_batches(batch_size, object_class, callback):

    num_els = object_class.count() 
    num_loops = num_els / batch_size
    remainder = num_els - num_loops * batch_size
    logging.info("Calling batched loop with batch_size: %d, num_els: %s, num_loops: %s, remainder: %s, object_class: %s, callback: %s," % (batch_size, num_els, num_loops, remainder, object_class, callback))    
    offset = 0
    while offset < num_loops * batch_size:
        logging.info("Processing batch (%d:%d)" % (offset, offset+batch_size))
        query = object_class[offset:offset + batch_size]
        for q in query:
            callback(q)

        offset = offset + batch_size

    if remainder:
        logging.info("Processing remainder batch (%d:%d)" % (offset, num_els))
        query = object_class[offset:num_els]
        for q in query:
            callback(q)