Highly variable performance on datastore and memcache operations (GAE)

Question

I am trying to optimize performance on GAE but once I deploy I get very unstable results. It's really hard to see if each optimization actually works because datastore and memcache operations take a very variable time (it ranges from milliseconds to seconds for the same operations).

For these tests I am the only one making only one request on the application by refreshing the homepage. There is no other people/traffic happening (besides my own browser requesting images/css/js files from the page).

Edit: To make sure that the drops were not due to concurrent requests from the browser (images/css/js), I've redone the test by requesting ONLY the page with urllib2.urlopen(). Problem persists.

My questions are:

1) Is this something to expect due to the fact that machines/resources are shared?
2) What are the most common cases where this behavior can happen?
3) Where can I go from there?

Here is a very slow datastore get (memcache was just flushed): Ultra slow datastore get Full size

Here is a very slow memcache get (things are cached because of the previous request): Full size

Here is a slow but faster memcache get (same repro step as the previous one, different calls are slow): enter image description here Full size

your datastore get is very slow! what is the query you are trying to run and is it depending on zig-zag join? — user1431972
It's just entities with various strings, integers, references (no blob IN the models). I can run the test twice and get a 10ms query or get, and run it again and get it run in 7 seconds, one more time and get 200ms. At least if it was consistently slow I'd know my query/data is bad. — Romz
Are you sure you have a running instance on the slow requests, and your not measuring startup to time as well — Tim Hoffman
Yes there is a running instance. When there is none, the log shows: This request caused a new process to be started for your application, and thus caused your application code to be loaded for the first time. This request may thus take longer and use more CPU than a typical request for your application. - Also I doubt that a booting instance would show up in individual datastore/memcache operations in appstats. — Romz
are you using NDB? If yes, did you try to do some of your datastore operation asynchronously. This way you would only depend on the latency of slowest operation (instead of the cummulated latencies). Related: you might also want to take a look at proppy-appstats.appspot.com which introduce different datastore optimization patterns. — proppy

Brent Washburne Brent Washburne · Accepted Answer · 2013-07-12T21:26:27

To answer your questions,

1) yes, you can expect variance in remote calls because of the shared network;

2) the most common place you will see variance is in datastore requests -- the larger/further the request, the more variance you will see;

3) here are some options for you:

It looks like you are trying to fetch large amounts of data from the datastore/memcache. You may want to re-think the queries and caches so they retrieve smaller chunks of data. Does your app need all that data for a single request?

If the app really needs to process all that data on every request, another option is to preprocess it with a background task (cron, task queue, etc.) and put the results into memcache. The request that serves up the page should simply pick the right pieces out of the memcache and assemble the page.

@proppy's suggestion to use NDB is a good one. It takes some work to rewrite serial queries into parallel ones, but the savings from async calls can be huge. If you can benefit from parallel tasks (using map), all the better.

Highly variable performance on datastore and memcache operations (GAE)

1 Answers