1
votes

I am designing google datastore schema for the classic 'User Posts' and 'Tags' questions.

This page suggests Relation Index Entities model. Basically it puts searchable tags or keywords as list property in child entity for filtering, and the necessary properties in parent entity. To my understanding, this approach is to reduce serialization overhead at query time.

class Post(db.Model):
  title = db.StringProperty()
  post_date = db.DateTimeProperty()

class Tags(db.Model):
  tags = db.StringListProperty()

mytags = Tags(parent=post, tags=many_tags)
  1. Given projection queries can get a subset of properties, is Relation Index Entities still necessary to reduce serialization overhead of list properties?

Note: projection query has limits; Relation Index Entity doesn't.

  1. Does Relation Index Entities make query more difficult? Saying I want to filter on the post with tag 'cars' for the posts created within last 7 days. tags and post_date are in different kinds, is there an easy way to do that?

  2. Regarding exploding indexes, does Relation Index Entities reduce the chance of exploding indexes, since it put list properties in different kinds?

Thanks for answering in advance.

1
Side note: from cloud.google.com/appengine/docs/standard/python/datastore/…: "Note: Developers building new applications are strongly encouraged to use the NDB Client Library, which has several benefits compared to this client library, such as automatic entity caching via the Memcache API. If you are currently using the older DB Client Library, read the DB to NDB Migration Guide"Dan Cornilescu

1 Answers

1
votes
  1. The Relation Index Entity solution reduces the serialization overhead at any type of access to the Post entities, including ops like key.get(), `entity.put() or fetching non-projection queries, while projection queries only do that for, well, fetching the respective query results.

  2. Yes, queries are a bit more difficult. For your example you'll need separate queries, one for each entity kind.

The example assumes using ndb, not db:

from google.appengine.ext import ndb

class Post(ndb.Model):
  title = ndb.StringProperty()
  post_date = ndb.DateTimeProperty()

class Tags(ndb.Model):
  tags = ndb.StringProperty(repeated=True)

I'd use keys-only queries as they're cheaper and faster:

from datetime import datetime, timedelta

car_post_keys = []
post_keys = Post.query(Post.post_date>(datetime.utcnow() - timedelta(days=7)),
                       keys_only=True).fetch()
if post_keys is not None:
    for post_key in post_keys:
        if Tags.query(Tags.tags=='car', ancestor=post_key, keys_only=True).fetch() is not None:
            car_post_keys.append(post_key)

car_posts = ndb.get_multi(car_post_keys) if car_post_keys else []
  1. In general the answer would be yes, exactly for the reason you mention. In your particular example there is only one property with multiple values - tags - and a small number of other Post properties, all with single values, so the difference in exploding index impact would probably be neglijible.

Splitting an entity in several smaller ones is a common technique used for other reasons as well, see, for example, re-using an entity's ID for other entities of different kinds - sane idea?.

Here's an example of applying this idea here:

car_post_key_ids = []

post_keys = Post.query(Post.post_date>(datetime.utcnow() - timedelta(days=7)),
                       keys_only=True).fetch()
if post_keys is not None:
    post_key_ids = [key.id() for key in post_keys]

    car_tag_keys = Tags.query(Tags.tags=='car', keys_only=True).fetch()
    car_tag_key_ids = [key.id() for key in car_tag_keys] if car_tag_keys is not None else []

    car_post_key_ids = list(set(post_key_ids) & set(car_tag_key_ids))

car_posts = [Post.get_by_id(id) for id in car_post_key_ids]

The examples are rather simplistic, they can be optimized using ndb async calls, tasks/tasklets, cursors may be needed for many results, etc.