14
votes

I'm a total App Engine newbie, and I want to confirm my understanding of the high replication datastore.

The documentation says that entity groups are a "unit of consistency", and that all data is eventually consistent. Along the same lines, it also says "queries across entity groups can be stale".

Can someone provide some examples where queries can be "stale"? Is it saying I could potentially save an entity without any parent (ie. it's own group), then query for it very soon after and not find it? Does it also imply that if I want data to be always 100% up-to-date I need to save them all in the same entity group?

Is the common workaround for this to use memcache to cache entities for a period of time longer than the average time it takes for data to become consistent across all data centers? What's the ballpark latency for that?

Thanks

3

3 Answers

18
votes

Is it saying I could potentially save an entity without any parent (ie. it's own group), then query for it very soon after and not find it?

Correct. Technically, this is the case for the regular Master-Slave datastore, too, as indexes are updated asynchronously, but in practice the window of time in which that could happen is so incredibly small you never see it.

If by "query" you mean "do a get by key", though, that will always return strongly consistent results in either implementation.

Does it also imply that if I want data to be always 100% up-to-date I need to save them all in the same entity group?

You'll need to define what you mean by "100% up-to-date" before it's possible to answer that.

Is the common workaround for this to use memcache to cache entities for a period of time longer than the average time it takes for data to become consistent across all data centers?

No. Memcache is strictly for improving access times; you shouldn't use it in any situation where cache eviction will cause trouble.

Strongly consistent gets are always available to you if you need to guarantee that you're seeing the latest version. Without a concrete example of what you're trying to do, though, it's difficult to provide a recommendation.

11
votes

Obligatory blog example setup; Authors have Posts

class Author(db.Model):
    name = db.StringProperty()

class Post(db.Model):
    author = db.ReferenceProperty()
    article = db.TextProperty()

bob = Author(name='bob')
bob.put()

first thing to remember is that regular get/put/delete on a single entity group (including single entity) will work as expected:

post1 = Post(article='first article', author=bob)
post1.put()

fetched_post = Post.get(post1.key())
# fetched_post is latest post1

You will only be able notice inconstancy if you start querying across multiple entity groups. Unless you have specified a parent attribute, all your entities are in separate entity groups. So if it was important that straight after bob creates a post, that he can see there own post then we should be careful with the following:

fetched_posts = Post.all().filter('author =', bob).fetch(x)
# fetched_posts _might_ contain latest post1

fetched_posts might contain the latest post1 from bob, but it might not. This is because all the Posts are not in the same entity group. When querying like this in HR you should think "fetch me probably the latest posts for bob".

Since it is important in our application that the author can see his post in the list straight after creating it, we will use the parent attribute to tie them together, and use an ancestor query to fetch the posts only from within that group:

post2 = Post(parent=person, article='second article', author=bob)
post2.put()

bobs_posts = Post.all().ancestor(bob.key()).filter('author =', bob).fetch(x)

Now we know that post2 will be in our bobs_posts results.

If the aim of our query was to fetch "probably all the latest posts + definitely latest posts by bob" we would need to do another query.

other_posts = Post.all().fetch(x)

Then merge the results other_posts and bobs_posts together to get the desired result.

5
votes

Having just migrated my app over from the Master/Slave to the High Replication datastore, I have to say that in practice, eventual consistency isn't a problem for most applications.

Consider the classic guestbook example, where you put() a new guestbook post Entity and then immediately query all the posts in the guestbook. With the High Replication datastore, you won't see the new post appear in the query results until a few seconds later (at Google I/O, the Google engineers said that the lag was on the order of 2-5 seconds).

Now, in practice, your guestbook app is probably doing an AJAX post of the new guestbook post entry. There is no need to refetch all the posts after submitting the new post. The webapp can simply insert the new entry into the UI once the AJAX request has succeeded. By the time the user leaves the webpage and returns to it, or even hits the browser refresh button, several seconds will have elapsed, and it is very likely that the new post will be returned by the query that pulls in all the guestbook posts.

Finally, note that the eventual consistency performance applies only to queries. If you put() an entity and immediately call db.get() to fetch it back, the result is strongly consistent, i.e. you will get the latest snapshot of the entity.